我正在尝试加快几个大型多级数据帧的求和速度。
这是一个示例:
df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage,
#they can also be mul_df(5000,30,400)
df2, df3, df4 = df1, df1, df1
In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop
我对 993ms 不满意,有什么办法可以加快速度吗? cython 可以提高性能吗?如果是,如何编写 cython 代码?谢谢。
Note:
mul_df()
是创建演示多级数据帧的函数。
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
Update:
我的奔腾双核的数据[电子邮件受保护] /cdn-cgi/l/email-protection、3.00GB RAM、WindowXP、Python 2.7.4、Numpy 1.7.1、Pandas 0.11.0、numexpr 2.0.1(Anaconda 1.5.0(32 位))
In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne
In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1
In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop
In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop
In [9]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop