数据集划分:sklearn.model_selection.train_test_split(*arrays, **options)
主要参数说明:
*arrays:可以是列表、numpy数组、scipy稀疏矩阵或pandas的数据框
test_size:可以为浮点、整数或None,默认为None
①若为浮点时,表示测试集占总样本的百分比
②若为整数时,表示测试样本样本数
③若为None时,test size自动设置成0.25
train_size:可以为浮点、整数或None,默认为None
①若为浮点时,表示训练集占总样本的百分比
②若为整数时,表示训练样本的样本数
③若为None时,train_size自动被设置成0.75
random_state:可以为整数、RandomState实例或None,默认为None
①若为None时,每次生成的数据都是随机,可能不一样
②若为整数时,每次生成的数据都相同
stratify:可以为类似数组或None
①若为None时,划分出来的测试集或训练集中,其类标签的比例也是随机的
②若不为None时,划分出来的测试集或训练集中,其类标签的比例同输入的数组中类标签的比例相同,可以用于处理不均衡的数据集
通过简单栗子看看各个参数的作用:
①test_size决定划分测试、训练集比例
-
In [
1]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
-
-
-
-
-
-
-
Out[
4]: (array([
18,
1,
19,
8,
10]), [
'A',
'B',
'A',
'B',
'A'])
②random_state不同值获取到不同的数据集
设置random_state=0再运行一次,结果同上述相同
-
In [
5]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
Out[
5]: (array([
18,
1,
19,
8,
10]), [
'A',
'B',
'A',
'B',
'A'])
设置random_state=None运行两次,发现两次的结果不同
-
In [
6]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
Out[
6]: (array([
3,
18,
14,
7,
4]), [
'A',
'A',
'A',
'B',
'A'])
-
-
In [
7]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
Out[
7]: (array([
18,
6,
3,
14,
8]), [
'A',
'A',
'A',
'A',
'B'])
③设置stratify参数,可以处理数据不平衡问题
-
In [
8]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
Out[
8]: (array([
18,
8,
3,
10,
11]), [
'A',
'B',
'A',
'A',
'B'])
-
-
In [
9]:
import numpy
as np
-
...:
from sklearn.model_selection
import train_test_split
-
-
...: y = [
'A',
'B',
'A',
'A',
'A',
'B',
'A',
'B',
'B',
'A',
'A',
'B',
'B',
'A',
'A',
'B',
'A
-
-
...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=
0.25
-
-
-
-
Out[
9]: (array([
6,
19,
8,
17,
0]), [
'A',
'A',
'B',
'B',
'A'])
-
-
In [
10]: X_train,y_train
-
-
(array([
7,
1,
11,
10,
15,
2,
3,
5,
4,
13,
12,
16,
18,
14,
9]),
-
[
'B',
'B',
'B',
'A',
'B',
'A',
'A',
'B',
'A',
'A',
'B',
'A',
'A',
'A',
'A'])
设置stratify=y时,我们发现每次划分后,测试集和训练集中的类标签比例同原始的样本中类标签的比例相同,都为2:3