这段代码:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
X = 'some_data'
y = 'some_target'
penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)
触发以下警告:
FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
- model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
Ridge(alpha=1.5e-05)
但是这些代码给了我完全不同的系数,正如预期的那样,因为归一化和标准化是不同的。
B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)
如果我设置结果仍然不匹配alpha = penalty * n_features
.
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)
虽然Ridge()
使用的标准化与我预期的有点不同:
回归量 X 将通过减去均值并除以进行归一化
l2范数
那么使用岭回归和归一化的正确方法是什么?
考虑到l2-norm似乎是在预测、数据修改和再次拟合之后获得的
在使用 sklearn 的岭回归时,我没有想到什么,特别是在 1.2 版本之后
prepare data https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view用于实验:
url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)
X = data.iloc[:,:15]
y = data['target']