我面临着计算给定出生日期和任意日期的年龄(以年、月或周为单位)的常见任务。问题是,我经常必须对许多记录(> 3 亿)执行此操作,因此性能是这里的一个关键问题。
在 SO 和 Google 中快速搜索后,我找到了 3 个替代方案:
- 常见的算术过程 (/365.25) (link)
- 使用函数
new_interval()
and duration()
从包装中lubridate
(link)
- 功能
age_calc()
从包装中eeptools
(link, link, link)
所以,这是我的玩具代码:
# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01",
"1962-12-30", "1962-12-31", "1963-01-01",
"2000-06-16", "2000-06-17", "2000-06-18",
"2007-03-18", "2007-03-19", "2007-03-20",
"1968-02-29", "1968-02-29", "1968-02-29"))
# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31",
"2015-12-31", "2015-12-31", "2015-12-31",
"2050-06-17", "2050-06-17", "2050-06-17",
"2008-03-19", "2008-03-19", "2008-03-19",
"2015-02-28", "2015-03-01", "2015-03-02"))
# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25
# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years")
# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")
我们稍后讨论准确性,首先关注性能。这是代码:
# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
times = 1000
)
# And examine the results
mbm
autoplot(mbm)
结果如下:
![Microbenchmark results - plot](https://i.stack.imgur.com/PYQ8g.png)
底线:性能lubridate
and eeptools
函数比算术方法差很多(/365.25至少快10倍)。不幸的是,算术方法不够准确,我无法承受这种方法所犯的一些错误。
“因为现代公历的方式
是构造出来的,没有简单的算术
产生一个人的年龄的方法,根据
常见用法——常见用法意味着一个人的
年龄应该始终是一个恰好在
一个生日”。 (link)
当我读到一些帖子时,lubridate
and eeptools
不要犯这样的错误(尽管,我还没有查看代码/阅读有关这些函数的更多信息以了解它们使用哪种方法),这就是我想使用它们的原因,但它们的性能不适用于我的实际应用程序。
关于有效且准确的年龄计算方法有什么想法吗?
EDIT
哦,看来lubridate
也会犯错误。显然,基于这个玩具示例,它比算术方法犯的错误更多(参见第 3、6、9、12 行)。 (难道我做错了什么?)
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df
birthdate givendate arithmetic lubridate eeptools
1 1978-12-30 2015-12-31 37 37 37
2 1978-12-31 2015-12-31 36 37 37
3 1979-01-01 2015-12-31 36 37 36
4 1962-12-30 2015-12-31 53 53 53
5 1962-12-31 2015-12-31 52 53 53
6 1963-01-01 2015-12-31 52 53 52
7 2000-06-16 2050-06-17 50 50 50
8 2000-06-17 2050-06-17 49 50 50
9 2000-06-18 2050-06-17 49 50 49
10 2007-03-18 2008-03-19 1 1 1
11 2007-03-19 2008-03-19 1 1 1
12 2007-03-20 2008-03-19 0 1 0
13 1968-02-29 2015-02-28 46 47 46
14 1968-02-29 2015-03-01 47 47 47
15 1968-02-29 2015-03-02 47 47 47