特征选择和预测

2024-02-18

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

我有 X 和 Y 数据。

data = load_iris()    
X = data.data
Y = data.target

我想使用 k 倍验证方法来实现 RFECV 特征选择和预测。

从答案@更正代码https://stackoverflow.com/users/3374996/vivek-kumar https://stackoverflow.com/users/3374996/vivek-kumar

clf = RandomForestClassifier()

kf = KFold(n_splits=2, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', clf)]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, cv=kf, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_)

编辑（对于剩余的少量）：

X_new = rfecv.transform(X)
print X_new.shape

y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

不要将 StandardScaler 和 RFECV 包装在同一管道中，而是对 StandardScaler 和 RandomForestClassifier 执行此操作，并将该管道作为估计器传递给 RFECV。在此不会泄露任何 traininf 信息。

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

pipeline = Pipeline(estimators)


rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

Update: 关于错误'RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes'

是的，这是 scikit-learn 管道中的一个已知问题。你可以看看我的另一个在这里回答 https://stackoverflow.com/a/51418655/3374996更多详细信息并使用我在那里创建的新管道。

像这样定义自定义管道：

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

并使用它：

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

Update 2:

@brute，对于您的数据和代码，算法在一分钟内即可在我的电脑上完成。这是我使用的完整代码：

import numpy as np
import glob
from sklearn.utils import resample
files = glob.glob('/home/Downloads/Untitled Folder/*') 
outs = [] 
for fi in files: 
    data = np.genfromtxt(fi, delimiter='|', dtype=float) 
    data = data[~np.isnan(data).any(axis=1)] 
    data = resample(data, replace=False, n_samples=1800, random_state=0) 
    outs.append(data) 

X = np.vstack(outs) 
print X.shape 
Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800) 
print Y.shape

#from sklearn.utils import shuffle
#X, Y = shuffle(X, Y, random_state=0)

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = RandomForestClassifier()

kf = KFold(n_splits=10, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_)

Update 3：对于 cross_val_predict

X_new = rfecv.transform(X)
print X_new.shape

# Here change clf to pipeline, 
# because RFECV has found features according to scaled data,
# which is not present when you pass clf 
y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

scikitlearn