I'd use sklearn.feature_extraction.text.TfidfVectorizer http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html,它是专门为此类任务设计的:
Demo:
In [63]: df
Out[63]:
Phrase Sentiment
0 is it good movie positive
1 wooow is it very goode positive
2 bad movie negative
解决方案:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
r = df[['Sentiment']].copy()
del df
df = pd.DataFrame(X, columns=vect.get_feature_names())
del X
del vect
r.join(df)
Result:
In [31]: r.join(df)
Out[31]:
Sentiment bad good goode wooow
0 positive 0.0 1.0 0.000000 0.000000
1 positive 0.0 0.0 0.707107 0.707107
2 negative 1.0 0.0 0.000000 0.000000
UPDATE:节省内存解决方案:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
for i, col in enumerate(vect.get_feature_names()):
df[col] = X[:, i]
UPDATE2: 内存问题最终解决的相关问题 https://stackoverflow.com/questions/41916560/pandas-dataframe-memory-python