Option 1
df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
bar 12
foo 16
super 10
dtype: int64
Option 2
df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()
bar 12
foo 16
super 10
dtype: int64
Option 3
numpy.bincount
+ pd.factorize
还强调使用cytoolz.mapcat
。它返回一个迭代器,在其中映射函数并连接结果。这很酷!
import pandas as pd, numpy as np, cytoolz
q = df['query'].values
c = df['count'].values
f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1
pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)
foo 16
bar 12
super 10
dtype: int64
Option 4
荒谬的使用东西...只需使用选项 1。
pd.DataFrame(dict(
query=' '.join(df['query']).split(),
count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()
query
bar 12
foo 16
super 10
Name: count, dtype: int64