通过日期时间列表根据日期列更新 pandas 数据框列

2024-04-24

老问题 https://stackoverflow.com/q/73168797/15975987

详情请参考上述问题。我需要添加 0.5 个工作日business_days第二个列表中不在第一个列表中的每个假期的列。这是一个名为 df 的示例输入predicted_df:

PredictionTargetDateEOM business_days
0       2022-06-30      22
1       2022-06-30      22
2       2022-06-30      22
3       2022-06-30      22
4       2022-06-30      22
        ... ... ...
172422  2022-11-30      21
172423  2022-11-30      21
172424  2022-11-30      21
172425  2022-11-30      21
172426  2022-11-30      21

PredictionTargetDateEOM 正是该月的最后一天。business_days指该月的工作日数,并且该月内的所有行都应该相同。这里有两个假期列表。对于第二个列表中存在但第一个列表中不存在的假期,business_days对于该假期月份出现的数据帧的每一行,列应添加 +0.5。

rocket_holiday = ["New Year's Day", "Martin Luther King Jr. Day", "Memorial Day", "Independence Day",
                 "Labor Day", "Thanksgiving", "Christmas Day"]
rocket_holiday_including_observed = rocket_holiday + [item + ' (Observed)' for item in rocket_holiday]
print(rocket_holiday_including_observed)
["New Year's Day",
 'Martin Luther King Jr. Day',
 'Memorial Day',
 'Independence Day',
 'Labor Day',
 'Thanksgiving',
 'Christmas Day',
 "New Year's Day (Observed)",
 'Martin Luther King Jr. Day (Observed)',
 'Memorial Day (Observed)',
 'Independence Day (Observed)',
 'Labor Day (Observed)',
 'Thanksgiving (Observed)',
 'Christmas Day (Observed)']
banker_hols = [i for i in holidays.US(years = 2022).values()]
print(banker_hols)
2022-01-01 New Year's Day
2022-01-17 Martin Luther King Jr. Day
2022-02-21 Washington's Birthday
2022-05-30 Memorial Day
2022-06-19 Juneteenth National Independence Day
2022-06-20 Juneteenth National Independence Day (Observed)
2022-07-04 Independence Day
2022-09-05 Labor Day
2022-10-10 Columbus Day
2022-11-11 Veterans Day
2022-11-24 Thanksgiving
2022-12-25 Christmas Day
2022-12-26 Christmas Day (Observed)

第二个列表实际上是通过以下方式从字典派生的:

import holidays
for name, date in holidays.US(years=2022).items():
    print(name, date)

原始版本看起来像这样:

{datetime.date(2022, 1, 1): "New Year's Day", datetime.date(2022, 1, 17): 'Martin Luther King Jr. Day', datetime.date(2022, 2, 21): "Washington's Birthday", datetime.date(2022, 5, 30): 'Memorial Day', datetime.date(2022, 6, 19): 'Juneteenth National Independence Day', datetime.date(2022, 6, 20): 'Juneteenth National Independence Day (Observed)', datetime.date(2022, 7, 4): 'Independence Day', datetime.date(2022, 9, 5): 'Labor Day', datetime.date(2022, 10, 10): 'Columbus Day', datetime.date(2022, 11, 11): 'Veterans Day', datetime.date(2022, 11, 24): 'Thanksgiving', datetime.date(2022, 12, 25): 'Christmas Day', datetime.date(2022, 12, 26): 'Christmas Day (Observed)'}

以下是显示所需结果的示例输出:

PredictionTargetDateEOM business_days
0       2022-06-30      22.5
1       2022-06-30      22.5
2       2022-06-30      22.5
3       2022-06-30      22.5
4       2022-06-30      22.5
        ... ... ...
172422  2022-11-30      21.5
172423  2022-11-30      21.5
172424  2022-11-30      21.5
172425  2022-11-30      21.5
172426  2022-11-30      21.5

正如您所看到的,由于 Juneteenth 和 Veterans Day 位于第二个列表中,而不是第一个列表中,因此我会为包含 6 月和 11 月作为月份的每一行的“business_days”列添加 0.5 天。但是,对于其他月份(例如 7 月或 1 月),两个列表之间共享假期,business_days这些月份的列应保持不变。最后,这种方法对于回填前几年的历史数据也应该是稳健的。我已经尝试过以下方法,但它不能按需要执行。它要么从数据框中删除整个月份,要么对于未删除的月份,不改变business_days我需要的几个月的元素。

main_list = list(set(banker_hols) - set(rocket_holiday_including_observed))
print(main_list)

['Columbus Day',
 'Juneteenth National Independence Day',
 "Washington's Birthday",
 'Juneteenth National Independence Day (Observed)',
 'Veterans Day']

result = []
for key, value in holidays.US(years = 2022).items():
    if value in main_list:
        result.append(key)
print(result)

[datetime.date(2022, 2, 21),
 datetime.date(2022, 6, 19),
 datetime.date(2022, 6, 20),
 datetime.date(2022, 10, 10),
 datetime.date(2022, 11, 11)]

所以我有几个月需要添加 0.5 个工作日,但我不知道如何更新business_days数据框中属于这些月份的所有行的列。

EDIT问题在这里解决:如果满足行条件,则将数量添加到 pandas 列 https://stackoverflow.com/q/73197589/15975987

我的答案包含了关键.loc()链接问题中显示的功能:

#Identify holidays in banker list not in rocket list
banker_hols = [i for i in holidays.US(years = 2022).values()]
hol_diffs = list(set(banker_hols) - set(rocket_holiday_including_observed))

#Extract dates of those holidays
dates_of_hols = []
for key, value in holidays.US(years = 2022).items():
    if value in hol_diffs:
        dates_of_hols.append(key)

#Extract just the months of those holidays
months = []
for item in dates_of_hols:
    months.append(item.month)
months = list(set(months))

#Add 0.5 to business_days for those months
predicted_df.loc[predicted_df['PredictionTargetDateEOM'].dt.month.isin(months), 'business_days'] += 0.5

我们只需要相关假期的日期:

relevant_holidays = {
    x: y for x, y in holidays.US(years=2022).items() 
    if y not in rocket_holiday_including_observed
}

我们使用 pandas magic 获取相应的月末日期:

holiday_month_end = pd.to_datetime(
    list(relevant_holidays.keys())
).to_period("M").to_timestamp("M")
DatetimeIndex(['2022-02-28', '2022-06-30', '2022-06-30', '2022-10-31',
               '2022-11-30'],
              dtype='datetime64[ns]', freq=None)

在加入之前,我们计算每个月的数量并乘以 0.5:

to_add = holiday_month_end.value_counts() * 0.5
2022-06-30    1.0
2022-02-28    0.5
2022-10-31    0.5
2022-11-30    0.5
dtype: float64

该索引现在是唯一的。要将其与数据框对齐,请使用reindex:

predicted_df["business_days"] = predicted_df["business_days"] + to_add.reindex(
    pd.to_datetime(predicted_df["PredictionTargetDateEOM"])
).fillna(0).values

The fillna是必要的,因为to_add没有每个月的条目。这values有必要删除索引,否则+会尝试匹配索引值而不是保持顺序。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

通过日期时间列表根据日期列更新 pandas 数据框列 的相关文章

随机推荐