机器学习-线性回归-sklearn

2023-10-26

线性模型

Scikit-Learn中的线性回归

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
#创建数据集
X=2*np.random.rand(10,1)#100行1列的随机初始化向量
y=4+3*X+np.random.randn(10,1)
#创建模型实例
lin_reg=LinearRegression()
lin_reg.fit(X,y)
plt.plot(X,y,"b.")
#使用模型
x_new=[[0],[2]]
y_predict=lin_reg.predict(X_new)
plt.plot(x_new,y_predict,"r-")
plt.axis([0,2,0,15])
plt.show()

[[4.43739203] [8.71719537]]
在这里插入图片描述

LinearRegression类基于scipy.linalg.lstsq()函数(最小二乘法)

import numpy as np
#创建数据集
X=2*np.random.rand(10,1)#100行1列的随机初始化向量
y=4+3*X+np.random.randn(10,1)
data= np.linalg.lstsq(X, y, rcond=1e-6)
print(data)
#theta_best_svd, residuals, rank, s

(array([[6.25848821]]), array([52.04262759]), 1, array([2.92786811]))

计算伪逆(Moore-Penrose逆)

X=2*np.random.rand(10,1)
print(np.linalg.pinv(X))

[[0.04394662 0.12665656 0.08815469 0.04822709 0.04879359 0.01847769 0.1103439 0.06393026 0.10650913 0.08576313]]

梯度下降

NumPy中使用梯度下降

import numpy as np
#data
X=np.random.randn(100,1)
X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
print(X_b[0])
y=4+3*X+np.random.randn(100,1)
eta=0.1 #learning rate
n_iterations=1000
m=100
theta=np.random.randn(2,1) #random initialization
for iteration in range(n_iterations):
    gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
    theta=theta-eta*gradients
print(theta)

[1. 0.68156223] [[4.02101178] [2.871612 ]]

随机梯度下降(SGD) Stochastic gradient descent

随机梯度下降的主要问题是它要用整个训练集来计算每一步的梯度，当训练集很大时
算法会很慢。与之相反的极端是随机梯度下降，每一步在训练集中随机选择一个实例，
并且仅基于该单个实例来计算梯度。
随机性的好处在于可以逃离局部最优，但缺点是永远定位不出最小值。要解决这个困境
有一个好办法就是逐步降低学习率，开始的步长比较大，然后越来越小，让算法尽量靠
近全局最小值。这个过程叫做模拟退火

import numpy as np 
import matplotlib.pyplot as plt
np.random.seed(42)

theta_path_sgd = []#记录中间迭代的过程
#data
X=np.random.randn(100,1)
X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
y=4+3*X+np.random.randn(100,1)
n_epochs=50
m=len(X_b)
t0,t1=5,50 # learning schedule hyperparameters(学习进度超参数)
def learning_schedule(t):
    return t0/(t+t1)
theta=np.random.randn(2,1) #random initialization
for epoch in range(n_epochs):
    for i in range(m):#每次进行m个回合的迭代，每个回和称为一个轮次
        random_index=np.random.randint(m)
        xi=X_b[random_index:random_index+1]
        yi=y[random_index:random_index+1]
        gradients=2*xi.T.dot(xi.dot(theta)-yi)
        eta=learning_schedule(epoch*m+i)
        theta=theta-eta*gradients
        theta_path_sgd.append(theta)
print(f"theta={theta}")
plt.xlabel("$x_1$", fontsize=18)                     # not shown
plt.ylabel("y", rotation=0, fontsize=18)           # not shown
plt.axis([0, 2, 0, 15])

#画出迭代过程
for i in range(len(theta_path_sgd)):
    x_draw=[[0],[2]]
    y_draw=[[theta_path_sgd[i][0][0]],[2*theta_path_sgd[i][1][0]+theta_path_sgd[i][0][0]]]
    plt.plot(x_draw,y_draw,"r-")
    if(i>50):#只画出前50次
        break
plt.plot(X,y,"b.")
plt.show()

theta=[[4.11199749] [2.82708341]]
在这里插入图片描述

Scikit-Learn的随机梯度下降

Scikit-Leearn的随机梯度下降执行线性回归，可以使用SGDRegressor类
,该类默认优化平方误差成本函数
代码最多可运行1000个轮次，或者直到一个轮次期间
损失下降小于0.001为止（max_iter=1000，tol=1e-3）。它使用默认的学习调度（与前一个
学习调度不同）以0.1（eta0=0.1）的学习率开始。最后，它不使用任何正则化
（penalty=None，稍后将对此进行详细介绍）：

from sklearn.linear_model import SGDRegressor
import numpy as np 
import matplotlib.pyplot as plt
X=np.random.randn(100,1)
y=4+3*X+np.random.randn(100,1)
sgd_reg=SGDRegressor(max_iter=1000,tol=1e-3,penalty=None,eta0=0.1)
sgd_reg.fit(X,y.ravel())#.ravel()将多维数组转为一维
print(sgd_reg.intercept_)#intercept（截距）
print(sgd_reg.coef_) #系数（斜率）
plt.plot(X,y,"b.")
x_draw=[[0],[2]]
y_draw=[[sgd_reg.intercept_[0]],[2*sgd_reg.coef_[0]+sgd_reg.intercept_[0]]]
plt.plot(x_draw,y_draw,"r-")
#plt.axis([0, 2, 0, 15])#坐标系显示范围
plt.show()

[4.18111695] [3.17350694]

在这里插入图片描述

小批量梯度下降

普通的梯度下降是使用全部训练集求梯度，随机梯度下降是随机选一个实例，
小批量梯度下降是在随机梯度下降的基础上每次随机选一小部分实例求梯度

多项式回归

使用线性模型来拟合非线性数据，在非线性方程特征集上训练一个线性模型，这种技术称为多项式回归。

import numpy as np 
import matplotlib.pyplot as plt
np.random.seed(42)

#data
m=100
X=6*np.random.rand(m,1)-3
y=0.7*X**2+X+3.6+np.random.randn(m,1)
plt.plot(X,y,"b.")#画出生成的非线性和噪声数据集
plt.show()

在这里插入图片描述
当存在多个特征时，多项式回归能够找到特征之间的关系（这是普通线性回归模型无法做到的）。PolynomialFeatures还可以将特征的所有组合添加到给定的多项式阶数。例如，如果有两个特征a和b，则degree=3的PolynomialFeatures不仅会添加特征a^2、a3、b^2和b3，还会添加组合ab、a^2b和ab2
PolynomialFeatures（degree=d）可以将一个包含n个特征的数组转换为包含阶乘(n+d)/(阶乘d * 阶乘 n) 个特征的数组

from sklearn.preprocessing import PolynomialFeatures# Polynomial Features（多项式特征）
from sklearn.linear_model import LinearRegression #线性回归
poly_features=PolynomialFeatures(degree=2,include_bias=False)
X_poly=poly_features.fit_transform(X)#多项式特征
#使用线性模型
lin_reg=LinearRegression()
lin_reg.fit(X_poly,y)
print(lin_reg.intercept_,lin_reg.coef_)
#[2.14407795] [[0.8758063  0.48173944]] y=0.48*x**2+0.87*x+2.14

#画出拟合的曲线
x_draw=[]
y_draw=[]
x=-100
while x<=100:
    x_draw.append(x)
    y_=lin_reg.coef_[0][1]*x**2+lin_reg.coef_[0][0]*x+lin_reg.intercept_[0]
    y_draw.append(y_)
    x=x+0.1
plt.plot(x_draw,y_draw)
plt.show()

[3.38134581] [[0.93366893 0.76456263]]
在这里插入图片描述

学习曲线

如果模型在训练数据上表现良好，但根据交叉验证的指标泛化较差，则你的模型过拟合。如果两者的表现均不理想，则说明欠拟合

均方误差 mean squared error
训练测试分 train_test_split

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression #线性回归

X=np.random.randn(100,1)
y=4+3*X+np.random.randn(100,1)
#learning_curves 学习曲线
def plot_learning_curves(model,X,y):
    X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2)
    train_errors,val_errors=[],[]
    #使用交叉验证
    for m in range(1,len(X_train)):#从使用两个数据开始拟合
        model.fit(X_train[:m],y_train[:m])#使用前m个训练集
        y_train_predict=model.predict(X_train[:m])#得出前m个训练集y的预测值
        y_val_predict=model.predict(X_val)#获得验证集预测值
        train_errors.append(mean_squared_error(y_train[:m],y_train_predict))
        #使用mean_squared_error获得训练集预测值均方误差
        val_errors.append(mean_squared_error(y_val,y_val_predict))
        #使用mean_squared_error获得验证集预测值均方误差
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    plt.legend(loc="upper right", fontsize=14)   # 曲线标注
    plt.xlabel("Training set size", fontsize=14) # x轴标签
    plt.ylabel("RMSE", fontsize=14)              # y轴标签
    plt.axis([0, 80, 0, 3])
    plt.show()
#线性回归模型实例
lin_reg=LinearRegression()
#调用学习曲线函数
plot_learning_curves(lin_reg,X,y)

在这里插入图片描述
在相同数据上的10阶多项式模型的学习曲线

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures# Polynomial Features（多项式特征）
#定义流水线
polynomial_regression=Pipeline([
    ("poly_features",PolynomialFeatures(degree=10,include_bias=False)),#多项式特征
    ("lin_reg",LinearRegression())#线性模型
])
plot_learning_curves(polynomial_regression,X,y)

在这里插入图片描述

曲线之间存在间隙。这意味着该模型在训练数据上的性能要比在验证数据上的性能
好得多，这是过拟合模型的标志。但是，如果你使用更大的训练集，则两条曲线会继续接
近

正则化线性模型

岭回归 Ridge 闭解式求解


from sklearn.linear_model import Ridge
#创建岭回归实例
ridge_reg=Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X,y)
print(ridge_reg.predict([[1.5]]))

[[8.44384141]]

岭回归 Ridge 随机梯度下降

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([8.46340189])

超参数penalty设置的是使用正则项的类型。设为"l2"表示希望SGD在成本函数中添加一个正则项，等于权重向量的 2范数的平方的一半，即岭回归。

Lasso回归

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

array([8.35575446])

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)