Use join http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html with get_dummies http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html, then groupby http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html并聚合max
:
df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Detail:
print (pd.get_dummies(df['Feature']))
A B C D
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
4 0 1 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 1 0
8 0 0 0 1
另一种解决方案是多标签二值化器 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html and DataFrame
构造函数:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
print (df1)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Timings:
np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper())
df = pd.DataFrame({'Feature': np.random.choice(L, N),
'ID':np.random.randint(10000,size=N)})
def jez(df):
mlb = MultiLabelBinarizer()
return pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop
In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop
#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop
#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop
#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop
Caveat
The results do not address performance given the proportion of
Feature
and
ID
, which will affect timings a lot for some of these solutions.