问题:将时间上彼此接近且还具有相同变量的事件分组在一起。例如,给定疾病发病日期和地址,查找在指定时间范围内在同一地点发生的疾病爆发。大 - 300K 行 - pandas 数据框。示例数据:
df = pd.DataFrame(
[
['2020-01-01 10:00', '1', 'A'],
['2020-01-01 10:01', '2', 'A'],
['2020-01-01 10:02', '3a', 'A'],
['2020-01-01 10:02', '3b', 'A'],
['2020-01-02 10:03', '4', 'B'],
['2020-01-02 10:50', '5', 'B'],
['2020-01-02 10:54', '6', 'B'],
['2020-01-02 10:55', '7', 'B'],
], columns=['event_time', 'event_id', 'Address']
)
输出应包含包含第一个和最后一个事件日期、事件列表和地址的行
event_time_start event_time_end events_and_related_event_id_list Address
0 2020-01-01 10:00:00 2020-01-01 10:02:00 [1, 2, 3a] A
6 2020-01-01 10:54:00 2020-01-01 10:55:00 [6, 7] B
编辑 - 澄清 - 解决方案
jezrael 匹配日期之前或之后指定天数内的日期的解决方案基于,但包含地址的 groupby。第一步无需修改真实数据即可完美运行。下面没有更改,只是为了清楚起见命名了一些值。
第二步不起作用,因为与示例数据不同,真实数据包含非连续和非顺序事件。这需要:按地址和事件时间对第一个输出进行排序;将 event_times 分组在一起的布尔系列的不同逻辑 (m/timeGroup_bool);并删除作为 Groupby.agg 的 df 过滤器的 bool 系列。
这是完整的解决方案,基于 jezrael 的简单回应(thef1 lambda,它从分组列表中收集所有值,在这里有最好的解释 https://stackoverflow.com/questions/17657720/python-list-comprehension-double-for).:
df = pd.DataFrame(
[
['1', 'A', '2020-01-01 10:00'],
['2', 'B', '2020-01-01 10:01'],
['3', 'A', '2020-01-01 10:01'],
['4', 'C', '2020-01-01 10:02'],
['5', 'D', '2020-01-01 10:03'],
['6', 'A', '2020-01-01 10:03'],
['7', 'E', '2020-01-01 10:03'],
['8', 'A', '2020-01-01 10:07'],
['9', 'A', '2020-01-01 10:09'],
['10', 'A', '2020-01-01 10:11'],
['11', 'F', '2020-01-01 10:54'],
['12', 'G', '2020-01-01 10:55'],
['13', 'F', '2020-01-01 10:56'],
], columns=['id', 'Address', 'event_time']
)
df = df.sort_values(by=["Address", "event_time"])
df['event_time'] = pd.to_datetime(df['event_time'])
## group by address and surrounding time
timeDiff = pd.Timedelta("2m") # time span between related events
def idsNearDates(mDf):
f = lambda colName, val: mDf.loc[mDf['event_time'].between(val - timeDiff, val + timeDiff),
'id'].drop(colName).tolist()
mDf['relatedIds'] = [f(colName, value) for colName, value in mDf['event_time'].items()]
return mDf
df_1stStep = df.groupby('Address').apply(idsNearDates).sort_values(by=["Address", 'event_time'])
## aggregate the initial output into a single row per related events
# mark where event times are too far apart
timeGroup_bool = ~(df_1stStep['event_time'].between(df_1stStep['event_time'].shift(1) - timeDiff,
df_1stStep['event_time'].shift(1) + timeDiff))
# create a single list from all grouped lists
f1 = lambda x: list(dict.fromkeys([value for idList in x for value in idList]))
df_2ndstep = (df_1stStep.groupby([(timeGroup_bool).cumsum(),'Address'])
.agg(Date_first=('event_time','min'),
Date_last=('event_time','max'),
Ids=('relatedIds',f1))
.droplevel(0)
.reset_index())
# get rid of rows with empty lists
df_2ndstep = df_2ndstep[df_2ndstep['Ids'].str.len() > 0]