最好的解决方案是引导,如评论中所述。我会用经典的iris数据,关注密度萼片长度对于每个Species.
生成样本
library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T)))
# A tibble: 1,000 x 2
# Groups: bs [1,000]
bs data
<int> <list>
1 1 <tibble [150 x 5]>
2 2 <tibble [150 x 5]>
3 3 <tibble [150 x 5]>
4 4 <tibble [150 x 5]>
5 5 <tibble [150 x 5]>
6 6 <tibble [150 x 5]>
7 7 <tibble [150 x 5]>
8 8 <tibble [150 x 5]>
9 9 <tibble [150 x 5]>
10 10 <tibble [150 x 5]>
# ... with 990 more rows
因此,我只是对原始数据进行了 1000 次引导复制,从每组中取出与其原始样本大小相同的行数,并进行替换。现在我必须unnest
访问嵌套中的数据data column.
计算样本内密度
densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T))) %>%
unnest() %>%
group_by(bs, Species) %>%
do(tidy(density(.$Sepal.Length,
from = min(iris$Sepal.Length),
to = max(iris$Sepal.Length),
n = 128)))
# A tibble: 384,000 x 4
# Groups: bs, Species [30,000]
bs Species x y
<int> <fctr> <dbl> <dbl>
1 1 setosa 4.300000 0.2395786
2 1 setosa 4.328346 0.2821128
3 1 setosa 4.356693 0.3235939
4 1 setosa 4.385039 0.3632449
5 1 setosa 4.413386 0.4010378
6 1 setosa 4.441732 0.4375189
7 1 setosa 4.470079 0.4734727
8 1 setosa 4.498425 0.5095333
9 1 setosa 4.526772 0.5459280
10 1 setosa 4.555118 0.5824587
# ... with 383,990 more rows
因此,我们将数据扩展为长形式,然后计算每个组的密度萼片长度 within Species within bs。我们必须提供手册from =
and to =
因为每个引导程序中的最小值和最大值可能不同(并设置较低的n =
比默认的 512 只是为了节省时间)。
为了简化S3: density
生成的对象,我们使用broom::tidy
。这是计算密集型步骤,因此我们将此对象保存为内密度.
将密度汇总为分位数
这会产生名为x and y,但我们将重命名它们以匹配我们的数据。然后我们会计算出: 对于每个计算出的可能的密度萼片长度,CI 的下限、中位数和 CI 的上限是多少?我们将使用quantile
以获得计算密度的这些特定值。
densities.qtiles <-
densities.within %>%
rename(Sepal.Length = x, dens = y) %>%
ungroup() %>%
group_by(Species, Sepal.Length) %>%
summarise(q05 = quantile(dens, 0.025),
q50 = quantile(dens, 0.5),
q95 = quantile(dens, 0.975))
# A tibble: 384 x 5
# Groups: Species [?]
Species Sepal.Length q05 q50 q95
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.300000 0.05730022 0.2355335 0.4426299
2 setosa 4.328346 0.08177850 0.2734463 0.4970097
3 setosa 4.356693 0.09863062 0.3114570 0.5505578
4 setosa 4.385039 0.12459033 0.3430645 0.5884523
5 setosa 4.413386 0.15049699 0.3705389 0.6207344
6 setosa 4.441732 0.17494889 0.4006335 0.6418923
7 setosa 4.470079 0.19836510 0.4258006 0.6655006
8 setosa 4.498425 0.21106857 0.4555755 0.6971370
9 setosa 4.526772 0.23399070 0.4813130 0.7244413
10 setosa 4.555118 0.24863090 0.5108057 0.7708114
# ... with 374 more rows
可视化和比较
ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
facet_wrap(~Species, nrow = 2) +
geom_histogram(data = iris,
aes(Sepal.Length, ..density..),
colour = "black", fill = "white",
binwidth = 0.25, boundary = 0) +
geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
stat_density(data = iris,
aes(Sepal.Length, ..density.., color = "raw density"),
size = 2, geom = "line") +
geom_line(size = 1.5, aes(color = "bootstrapped")) +
scale_color_manual(values = c("red", "black")) +
labs(y = "density") +
theme(legend.position = c(0.5,0),
legend.justification = c(0,0))
我还包括原始数据的直方图和密度层以进行比较。您可以看到,1000 个引导样本的中值密度和原始密度非常接近。