一、分类
一、数据类型
1、python自带类型
list
tuple
dict
set
my_set = {'123','456',89,True}
2、numpy包中多维数组ndarray
ndarray内存方式优于列表,还允许矢量运算,存储数据类型一致的元素
import numpy as np
x=np.loadtxt(../xxx.csv,delimiter=',')
y=np.array([1,2,3])
3、pandas包中dataframe二维表格结构
(其他创建方式见这条博客)
import pandas as pd
df_train = pd.read_csv('../train.csv',header=None,names=['sepal,'sepal_wid','target'])
通过values进行访问,返回ndarray结构数组
df_train.values
array([[2000, 'Ohio', 1.5, 0],
[2001, 'Ohio', 1.7, 1],
[2002, 'Ohio', 3.6, 2],
[2001, 'Nevada', 2.4, 3],
[2002, 'Nevada', 2.9, 4]], dtype=object)
pd.series()
value.counts()
二、分类器
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
lr.score(X_test, y_test)
from sklearn.linear_model import SGDClassifier
sgdc = SGDClassifier()
from sklearn.svm import LinearSVC
lsvc = LinearSVC()
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
三、数据标准化
对之前部分trainData进行fit的整体指标,对剩余的数据(testData)使用同样的均值、方差、最大最小值等指标进行转换transform(testData),从而保证train、test处理方式相同。
对于回归,由于预测值也为连续,所以对标签也需要标准化
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
文本特征提取
CountVectorizer:只考虑词汇在文本中出现的频率。
TfidfVectorizer:除了考量某词汇在文本出现的频率,还关注包含这个词汇的所有文本的数量,能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征。
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
vec = CountVectorizer()
tfvec = TfidfVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test)
决策树特征转换
有多少特征值就产生多少特征
https://blog.csdn.net/Jon_Sheng/article/details/79693971
from sklearn.feature_extraction import DictVectorizer
dict_vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
DataFrame.to_dict(orient='record')
https://blog.csdn.net/weixin_30577801/article/details/101335405
四、数据划分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=33)
五、分类结果报告
分析精确率,召回率,F 等
from sklearn.metrics import classification_report
print (classification_report(y_test, y_predict, target_names=digits.target_names.astype(str)))
六、绘图
import matplotlib.pyplot as plt
plt.scatter(x,y,s=20,c='b',marker='x')
x = [1, 2, 3, 4] y = [1, 2, 20, 50]
二、回归
一、分类器
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
from sklearn.linear_model import SGDRegressor
sgdr = SGDRegressor()
from sklearn.svm import SVR
linear_svr = SVR(kernel='linear')
poly_svr = SVR(kernel='poly')
rbf_svr = SVR(kernel='rbf')
from sklearn.neighbors import KNeighborsRegressor
uni_knr = KNeighborsRegressor(weights='uniform')
dis_knr = KNeighborsRegressor(weights='distance')
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
rfr = RandomForestRegressor()
etr = ExtraTreesRegressor()
gbr = GradientBoostingRegressor()
二、评价方式
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
print ('The value of R-squared of LinearRegression is', r2_score(y_test, lr_y_predict))
print ('The mean squared error of LinearRegression is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))
print ('The mean absoluate error of LinearRegression is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)