使用 Pandas 提高大型 HDFStore 表的查询性能

2023-12-27

我有一个大型（约 1.6 亿行）数据框，我已将其存储到磁盘中，如下所示：

    def fillStore(store, tablename):
        files = glob.glob('201312*.csv')
        names = ["ts", "c_id", "f_id","resp_id","resp_len", "s_id"]
        for f in files:
            df = pd.read_csv(f, parse_dates=True, index_col=0, names=names)
            store.append(tablename, df, format='table', data_columns=['c_id','f_id'])

该表有一个时间索引，我将使用c_id and f_id除了时间（通过索引）。

我有另一个包含约 18000 个“事件”的数据框。每个事件都包含一些（少则数百，多则数十万）单独记录。我需要为每个事件收集一些简单的统计数据并存储它们，以便收集一些汇总统计数据。目前我这样做：

def makeQueryString(c, f, start, stop):
    return "c_id == {} & f_id == {} & index >= Timestamp('{}') & index < Timestamp('{}')".format(c, f , str(pd.to_datetime(start)),str(pd.to_datetime(stop)))

def getIncidents(inc_times, store, tablename):
    incidents = pd.DataFrame(columns = ['c_id','f_id','resp_id','resp_len','s_id','incident_id'])
    for ind, row in inc_times.iterrows():
        incidents = incidents.append(store.select(tablename, 
                                                  makeQueryString(row.c_id, 
                                                                  row.f_id, 
                                                                  row.start, 
                                                                  row.stop))).fillna(ind)
    return incidents

这一切都工作正常，除了每个store.select()语句大约需要 5 秒，这意味着处理整个月的数据需要 24-30 小时的处理时间。同时，我实际需要的统计数据也比较简单：

def getIncidentStats(df):
    incLen = (df.index[-1]-df.index[0]).total_seconds()
    if incLen == 0:
        incLen = .1
    rqsts = len(df)
    rqstRate_s = rqsts/incLen
    return pd.Series({'c_id':df.c_id[0],
                      'f_id':df.fqdn_id[0],
                      'Length_sec':incLen, 
                      'num_rqsts':rqsts, 
                      'rqst_rate':rqstRate_s, 
                      'avg_resp_size':df.response_len.mean(), 
                      'std_resp_size':df.response_len.std()})


incs = getIncidents(i_times, store, tablename)
inc_groups = incs.groupby('incident_id')
inc_stats = inc_groups.apply(getIncidentStats)

我的问题是：如何提高此工作流程任何部分的性能或效率？（请注意，实际上我对大部分作业进行批处理以一次获取和存储事件，只是因为我想限制在崩溃时丢失已处理数据的风险。为了简单起见，我将这段代码留在这里因为我实际上需要处理整个月的数据。）

有没有办法在我从商店收到数据时处理数据，这有什么好处吗？使用 store.select_as_index 我会受益吗？如果我收到索引，我仍然需要访问数据才能获得正确的统计信息吗？

其他注释/问题：我比较了在 SSD 和普通硬盘上存储 HDFStore 的性能，没有发现 SSD 有任何改进。这是预期的吗？

我还考虑过创建一个大的查询字符串连接并同时请求它们。当总查询字符串太大（~5-10 个查询）时，这会导致内存错误。

Edit 1如果重要的话，我使用的是表版本 3.1.0 和 pandas 版本 0.13.1

Edit 2以下是更多信息：

ptdump -av store.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/all_recs (Group) ''
  /all_recs._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['c_id', 'f_id'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'ts'}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['c_id', 'f_id', 'resp_id', 'resp_len', 'dns_server_id'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'c_id', 'f_id']]
/all_recs/table (Table(161738653,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(3,), dflt=0, pos=1),
  "c_id": Int64Col(shape=(), dflt=0, pos=2),
  "f_id": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "f_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "c_id": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /all_recs/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'c_id',
    FIELD_3_FILL := 0,
    FIELD_3_NAME := 'f_id',
    NROWS := 161738653,
    TITLE := '',
    VERSION := '2.6',
    client_id_dtype := 'int64',
    client_id_kind := ['c_id'],
    fqdn_id_dtype := 'int64',
    fqdn_id_kind := ['f_id'],
    index_kind := 'datetime64',
    values_block_0_dtype := 'int64',
    values_block_0_kind := ['s_id', 'resp_len', 'resp_id']]

以下是主表和 inc_times 的示例：

In [12]: df.head()
Out[12]: 
                          c_id        f_id          resp_id      resp_len  \
ts                                                                   
2013-12-04 08:00:00  637092486  5372764353               30      56767543   
2013-12-04 08:00:01  637092486  5399580619               23      61605423   
2013-12-04 08:00:04    5456242  5385485460               21      46742687   
2013-12-04 08:00:04    5456242  5385485460               21      49909681   
2013-12-04 08:00:04  624791800  5373236646               14      70461449   

                              s_id  
ts                           
2013-12-04 08:00:00           1829  
2013-12-04 08:00:01           1724  
2013-12-04 08:00:04           1679  
2013-12-04 08:00:04           1874  
2013-12-04 08:00:04           1727  

[5 rows x 5 columns]


In [13]: inc_times.head()
Out[13]: 
        c_id     f_id                start                 stop
0       7254   196211  1385880945000000000  1385880960000000000
1       9286   196211  1387259840000000000  1387259850000000000
2      16032   196211  1387743730000000000  1387743735000000000
3      19793   196211  1386208175000000000  1386208200000000000
4      19793   196211  1386211800000000000  1386211810000000000

[5 rows x 4 columns]

关于c_id和f_id，与商店中的ID总数相比，我要从全商店中选择的ID集合相对较少。也就是说，inc_times中有一些流行的ID我会重复查询，而完全忽略全表中存在的一些ID。我估计我关心的 ID 大约占 ID 总数的 10%，但这些是最受欢迎的 ID，因此它们的记录在整个集合中占主导地位。

我有 16GB 内存。完整存储为 7.4G，完整数据集（作为 csv 文件）仅为 8.7 GB。最初，我相信我能够将整个内容加载到内存中，并至少对其进行一些有限的操作，但是在加载整个内容时出现内存错误。因此，将其批处理为每日文件（完整文件包含一个月的数据）。

这是一些建议，类似的问题是here https://stackoverflow.com/questions/15798209/pandas-group-by-query-on-large-data-in-hdfstore

使用压缩：参见here http://pandas-docs.github.io/pandas-docs-travis/io.html#compression。你应该尝试这个（这可能会使它更快/更慢，具体取决于你正在查询的内容），YMMV。

ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5

分块使用分层查询。我的意思是这样的。由于您的数量相对较少c_id and f_id你关心的，构造一个像这样的查询。这有点像使用isin.

f_ids = list_of_f_ids that I care about
c_ids = list_of_c_ids that I care about

def create_batches(l, maxn=32):
    """ create a list of batches, maxed at maxn """
    batches = []
    while(True):
        if len(l) <= maxn:
            if len(l) > 0:
                batches.append(l)
            break
        batches.append(l[0:maxn])
        l = l[maxn:]
    return batches


results = []
for f_id_batch in create_batches(f_id_list):

    for c_id_batch in create_batches(c_id_list):

        q = "f_id={f_id} & c_id={c_id}".format(
                f_id=f_id_batch,
                c_id=c_id_batch)

        # you can include the max/min times in here as well (they would be max/min
        # time for ALL the included batches though, maybe easy for you to compute

        result = store.select('df',where=q)

        # sub process this result

        def f(x):
            # you will need to filter out the min/max timestamps here (which I gather
            # are somewhat dependent on f_id/c_id group

            #### process the data and return something
            # you could do something like: ``return x.describe()`` for simple stats

         results.append(result.groupby(['f_id','c_id').apply(f))

results = pd.concat(results)

这里的关键是处理，以便isin成员人数不超过 32 人对于您正在查询的任何变量。这是 numpy/pytables 的内部限制。如果超过此值，查询将有效，但它将删除该变量并重新索引所有数据（这不是您想要的）。

这样，只需几个循环，您就可以在内存中拥有一个很好的数据子集。这些查询我认为大约需要与您的大多数查询相同的时间，但您的查询会少得多。

对于给定子集，查询时间大致恒定（除非数据经过排序以使其完全索引）。

因此，查询扫描数据“块”（这是索引所指向的）。如果跨多个块有大量命中，则查询速度会变慢。

这是一个例子

In [5]: N = 100000000

In [6]: df = DataFrame(np.random.randn(N,3),columns=['A','B','C'])

In [7]: df['c_id'] = np.random.randint(0,10,size=N)

In [8]: df['f_id'] = np.random.randint(0,10,size=N)

In [9]: df.index = date_range('20130101',periods=N,freq='s')

In [10]: df.to_hdf('test2.h5','df',mode='w',data_columns=['c_id','f_id'])

In [11]: df.head()
Out[11]: 
                            A         B         C  c_id  f_id
2013-01-01 00:00:00  0.037287  1.153534  0.639669     8     7
2013-01-01 00:00:01  1.741046  0.459821  0.194282     8     3
2013-01-01 00:00:02 -2.273919 -0.141789  0.770567     1     1
2013-01-01 00:00:03  0.320879 -0.108426 -1.310302     8     6
2013-01-01 00:00:04 -1.445810 -0.777090 -0.148362     5     5
2013-01-01 00:00:05  1.608211  0.069196  0.025021     3     6
2013-01-01 00:00:06 -0.561690  0.613579  1.071438     8     2
2013-01-01 00:00:07  1.795043 -0.661966  1.210714     0     0
2013-01-01 00:00:08  0.176347 -0.461176  1.624514     3     6
2013-01-01 00:00:09 -1.084537  1.941610 -1.423559     9     1
2013-01-01 00:00:10 -0.101036  0.925010 -0.809951     0     9
2013-01-01 00:00:11 -1.185520  0.968519  2.871983     7     5
2013-01-01 00:00:12 -1.089267 -0.333969 -0.665014     3     6
2013-01-01 00:00:13  0.544427  0.130439  0.423749     5     7
2013-01-01 00:00:14  0.112216  0.404801 -0.061730     5     4
2013-01-01 00:00:15 -1.349838 -0.639435  0.993495     0     9


In [2]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1] & c_id=[2]")
1 loops, best of 3: 13.9 s per loop

In [3]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1,2] & c_id=[1,2]")
1 loops, best of 3: 21.2 s per loop

In [4]: %timeit pd.read_hdf('test.2h5','df',where="f_id=[1,2,3] & c_id=[1,2,3]")
1 loops, best of 3: 42.8 s per loop

此特定示例为 5GB 未压缩和 2.9GB 压缩。这些结果基于压缩数据。在这种情况下，使用未压缩的文件实际上要快得多（例如第一个循环需要 3.5 秒）。这是100MM的行。

因此，使用最后一个示例 (4)，您将在查询时间的 3 倍多一点的情况下获得第一个示例的 9 倍数据。

但是，您的加速应该要高得多，因为您不会选择单个时间戳，而是稍后再选择。

整个方法考虑到您有足够的主内存来保存批量大小的结果（例如，您在批量查询中选择集合中相对较小的部分）。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

使用 Pandas 提高大型 HDFStore 表的查询性能的相关文章

使用 openCV 对图像中的子图像进行通用检测

免责声明我是计算机视觉菜鸟我看过很多关于如何在较大图像中查找特定子图像的堆栈溢出帖子我的用例有点不同因为我不希望它是具体的而且我不确定如何做到这一点如果可能的话但我感觉应该如此我有大量图像数据集有时其中一些图像是数据集的
如何在android上的python kivy中关闭应用程序后使服务继续工作

我希望我的服务在关闭应用程序后继续工作但我做不到我听说我应该使用startForeground 但如何在Python中做到这一点呢应用程序代码 from kivy app import App from kivy uix floatl
pandas 替换多个值

以下是示例数据框 gt gt gt df pd DataFrame a 1 1 1 2 2 b 11 22 33 44 55 gt gt gt df a b 0 1 11 1 1 22 2 1 33 3 2 44 4 3 55 现在我想根据
如何在Windows上模拟socket.socketpair

标准Python函数套接字套接字对 https docs python org 3 library socket html socket socketpair不幸的是它在 Windows 上不可用从 Python 3 4 1 开始我
SQL Alchemy 中的 NULL 安全不等式比较？

目前我知道如何表达 NULL 安全的唯一方法 SQL Alchemy 中的比较其中与 NULL 条目的比较计算结果为 True 而不是 NULL 是 or field None field value 有没有办法在 SQL Alchem
如何使用装饰器禁用某些功能的中间件？

我想模仿的行为csrf exempt see here https docs djangoproject com en 1 11 ref csrf django views decorators csrf csrf exempt and h
运行多个 scrapy 蜘蛛的正确方法

我只是尝试使用在同一进程中运行多个蜘蛛新的 scrapy 文档 http doc scrapy org en 1 0 topics practices html但我得到 AttributeError CrawlerProcess objec
NameError：名称“urllib”未定义”

CODE import networkx as net from urllib request import urlopen def read lj friends g name fetch the friend list from Liv
feedparser 在脚本运行期间失败，但无法在交互式 python 控制台中重现

当我运行 eclipse 或在 iPython 中运行脚本时它失败了 ascii codec can t decode byte 0xe2 in position 32 ordinal not in range 128 我不知道为什么但
python pandas 中的双端队列

我正在使用Python的deque 实现一个简单的循环缓冲区 from collections import deque import numpy as np test sequence np array range 100 2 resha
Pandas Dataframe 中 bool 值的条件前向填充

问题如何转发 fill boolTruepandas 数据框中的值如果是当天的第一个条目 True 到一天结束时请参阅以下示例和所需的输出 Data import pandas as pd import numpy as np df
通过数据框与函数进行交互

如果我有这样的日期框架氮 EG 00 04 NEG 04 08 NEG 08 12 NEG 12 16 NEG 16 20 NEG 20 24 datum von 2017 10 12 21 69 15 36 0 87 1 42 0 76
在Python中重置生成器对象

我有一个由多个yield 返回的生成器对象准备调用该生成器是相当耗时的操作这就是为什么我想多次重复使用生成器 y FunctionWithYield for x in y print x here must be something t
用于运行可执行文件的python多线程进程

我正在尝试将一个在 Windows 上运行可执行文件并管理文本输出文件的 python 脚本升级到使用多线程进程的版本以便我可以利用多个核心我有四个独立版本的可执行文件每个线程都知道要访问它们这部分工作正常我遇到问题的地方是当它们
对输入求 Keras 模型的导数返回全零

所以我有一个 Keras 模型我想将模型的梯度应用于其输入这就是我所做的 import tensorflow as tf from keras models import Sequential from keras layers imp
在 Python 类中动态定义实例字段

我是 Python 新手主要从事 Java 编程我目前正在思考Python中的类是如何实例化的我明白那个 init 就像Java中的构造函数然而有时 python 类没有 init 方法在这种情况下我假设有一个默认构造函数就像
您可以在 Python 类型注释中指定方差吗？

你能发现下面代码中的错误吗米皮不能 from typing import Dict Any def add items d Dict str Any gt None d foo 5 d Dict str str add items d f
协方差矩阵的对角元素不是 1 pandas/numpy

我有以下数据框 A B 0 1 5 1 2 6 2 3 7 3 4 8 我想计算协方差 a df iloc 0 values b df iloc 1 values 使用 numpy 作为 cov numpy cov a b I get ar
Python：元类属性有时会覆盖类属性？

下面代码的结果让我感到困惑 class MyClass type property def a self return 1 class MyObject object metaclass MyClass a 2 print MyObject
PyAudio ErrNo 输入溢出 -9981

我遇到了与用户相同的错误 Python 使用 Pyaudio 以 16000Hz 录制音频时出错 https stackoverflow com questions 12994981 python error audio recording

随机推荐

C2DM：如何使用C2D_MESSAGE权限？

我即将为我的应用程序实现 C2DM 但我发现文档 http code google com android c2dm writing apps关于如何编写清单有点令人困惑清单代码示例包含以下内容
使用联合登录 (OpenID) 从 Android 应用程序对 App Engine 进行身份验证

我遵循了 Nick Johnson 的教程通过 Android App Engine 进行身份验证 http blog notdot net 2010 05 Authenticating against App Engine from an
HTML 范围输入滑块在移动设备上消失。为什么？

为什么 HTML5 范围类型输入在移动设备上消失看this http www html5tutorial info html5 range php页面作为移动设备进行检查我正在开发一个带有表单的简单网站但需要该表单才能在移动设备上使
使用高斯核估计向量的 pdf

I am using Gaussian kernel to estimate a pdf of a data based on the equation where K is Gaussian kernel data is a given
将现有 AWS Lambda 和 API Gateway 导出到 Cloudformation 模板

如何将现有的已配置和测试的基础设施包括 AWS Lambda 函数 API 网关 ElastiCache 集群 Cloudwatch 规则导出到 Cloudformation 模板我了解 Cloudformer 工具但它支持有限数量
运算符重载矩阵乘法

我遇到的问题是如何让 K 的最内层循环经历正确的数字列一个例子是 2x3 矩阵和 3x2 矩阵相乘结果应该是一个 2x2 矩阵但目前我不知道如何将 2 的值发送给运算符重载函数它应该是整数 k 0 k Matrix Matrix
从 RavenDB 检索整个数据集合

我有一个要求我需要获取整个数据集合Users来自 RavenDB 并将检索到的结果集与另一组数据进行比较这个特定集合中有近 4000 条记录因为默认情况下 Raven 是安全的所以我不断收到以下任一异常Number of reque
在Python中调用超类的类方法

我正在开发一个 Flask 扩展为 Flask 添加 CouchDB 支持为了使它更容易我已经子类化couchdb mapping Document so the store and load方法可以使用当前线程本地数据库现在我的
Android App Bundle 在 Android 应用程序中引入了“资源未找到”崩溃

通过使用 Android 的新 Android App Bundle 我收到了Resource Not Found我的 2 个 Google Play 商店应用程序出现错误以下是其中一个应用程序的 Fabric 堆栈跟踪 Unable t
通用处理程序 (.ashx) 实现 IHttpAsyncHandler 时出现 IIS 7.0 503 错误

我在使用实现 IHttpAsyncHandler 的通用处理程序时遇到了一些性能问题最简单的是处理程序接收 GET 请求并在 20 秒后将写入响应后结束响应当使用 10000 20000 个并发请求攻击 ashx 时在恰好 50
Symfony3“选项“占位符”不存在。”

我想创建一个用户设置面板在表单中我希望有一个占位符其中包含用户参数的当前值这是示例代码 form this gt createFormBuilder user gt add Username TextType class array
悬停时更改图像

我需要一个由图像组成的菜单当有人将鼠标悬停在其周围时图像应该发生变化 HTML div a href img src images about png alt logo a div CSS menu margin left 353px
CentOS 中的“which java”打印错误的 java 路径

我不知道为什么哪个java and java 所在位置路径不正确我尝试编辑 bash profile 和 etc environment 但没有帮助所需的路径是在中看到的路径回显 JAVA HOME 下面但同样没有体现在哪个ja
laravel 中的“with”和“load”有什么区别

我已经浏览了 laravel 文档我没有明白两者之间的区别With or Load在查询中什么情况下我们需要使用With或Load 请描述一下 Model find 1 gt with firstModel SecondModel Mo
从 base64 编码图像 ruby on Rails 检索文件名和内容类型

我正在尝试检索以 Base64 编码格式接收的图像的内容类型和文件名这是使用 Base64 编码图像执行 POST 请求的代码 require net http require rubygems require active suppor
是否有类似字典的数据结构允许搜索“键”和“值”

我的小 Python 程序需要一个结构来保存列表最多 500 个每个名称都有一个数字名称是唯一的但数字会重复经常我首先想到一个字典 http docs python org library stdtypes html typesma
如何使用 VSTO 最好地从 Excel 中获取单元格值？

我正在尝试将单元格从 excel 导入 csharp 但不确定将其读入的最佳变量类型是什么如果我将变量设置为字符串并且单元格值为双精度值则会出现解析错误如果我将变量加倍那么当单元格是字符串时它将无法工作这是我正在运行的代码 tr
在 Android 中，如何在从锚点旋转自定义视图时固定其位置？

我有一个名为的自定义视图MyView有两个锚点我希望能够从其锚点旋转此视图其中一个锚点充当旋转中心而用户拖动另一个锚点为了简单起见我画了一条线但实际上我会画别的东西步骤如下如果用户单击视图本身则平移视图如果用户单击其中
尝试更改 mysql 密码时出错

我忘记了 mysql 的密码所以我尝试使用以下步骤更改它 1 停止Mysql服务器 2 使用以下命令以安全模式启动服务器sudo usr local mysql bin mysqld safe skip grant tables 3 使用
使用 Pandas 提高大型 HDFStore 表的查询性能

我有一个大型约 1 6 亿行数据框我已将其存储到磁盘中如下所示 def fillStore store tablename files glob glob 201312 csv names ts c id f id resp id

使用 Pandas 提高大型 HDFStore 表的查询性能

使用 Pandas 提高大型 HDFStore 表的查询性能 的相关文章

随机推荐

热门标签

使用 Pandas 提高大型 HDFStore 表的查询性能的相关文章