我有长格式的数据,将按地理位置分组。我想计算每组中一个感兴趣的变量与所有其他感兴趣的变量之间的差异。我无法弄清楚如何在单个数据表语句中有效地执行此操作,因此采取了一种解决方法,该解决方法也在此过程中引入了一些新错误(我用更多解决方法修复了这些错误,但也将不胜感激!)。
然后我想将结果列传递到 ggplot 函数中,但是无法使推荐的方法起作用,因此我使用了已弃用的方法。
library(data.table)
library(ggplot2)
set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
statistic = rnorm(16) )
> results[c(1:4,13:16)]
geography variable statistic
1: 1 alpha -0.62645381
2: 1 bravo 0.18364332
3: 1 charlie -0.83562861
4: 1 delta 1.59528080
5: 4 alpha -0.62124058
6: 4 bravo -2.21469989
7: 4 charlie 1.12493092
8: 4 delta -0.04493361
base_variable <- "alpha"
从这一点来看,我理想地想编写一段简单的代码,按地理位置分组,然后以相同的格式返回该表,但每个组中每个变量的统计数据为(base_variable - 变量)。
我不知道如何做到这一点,所以我的解决方法如下,任何有关更好方法的建议都将受到赞赏。
# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")
geography alpha bravo charlie delta
1: 1 -0.6264538 0.1836433 -0.8356286 1.59528080
2: 2 0.3295078 -0.8204684 0.4874291 0.73832471
3: 3 0.5757814 -0.3053884 1.5117812 0.38984324
4: 4 -0.6212406 -2.2146999 1.1249309 -0.04493361
this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))
alpha.alpha bravo.alpha charlie.alpha delta.alpha
1: 0 -0.8100971 0.2091748 -2.2217346
2: 0 1.1499762 -0.1579213 -0.4088169
3: 0 0.8811697 -0.9359998 0.1859381
4: 0 1.5934593 -1.7461715 -0.5763070
现在名字已经乱了,我们也没有地理。为什么名字是这样的?另外,需要重新添加地理。
this_is_a_hack[, geography := results_wide[, geography] ]
normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational.
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names
我不再需要基本变量,因为所有值都为零,所以我尝试删除它,但我似乎无法按照通常的方式执行此操作:
this_is_a_hack[, ..base_variable := NULL]
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
Column '..base_variable' does not exist to remove
library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)
final_result <- melt(this_is_a_hack, id.vars = "geography")
> final_result[c(1:4,9:12)]
geography variable value
1: 1 bravo -0.8100971
2: 2 bravo 1.1499762
3: 3 bravo 0.8811697
4: 4 bravo 1.5934593
5: 1 delta -2.2217346
6: 2 delta -0.4088169
7: 3 delta 0.1859381
8: 4 delta -0.5763070
数据现在可以可视化了。我试图将这些变量传递到绘图函数中,但是与数据帧相比,引用 data.table 列似乎很困难。显然,您应该使用 quosures 将 data.table 变量传递到函数中,但这只是出错了,所以我使用已弃用的 'aes_string' 函数来代替 - 对此的帮助也值得赞赏。
plott <- function(dataset, varx, vary, fillby) {
# varx <- ensym(varx)
# vary <- ensym(vary)
# vary <- ensym(fillby)
ggplot(dataset,
aes_string(x = varx, y = vary, color = fillby)) +
geom_point()
}
plott(dataset = final_result,
varx = "geography",
vary = "value",
fillby = "variable")
# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby.
Did you mistype the name of a data column or forget to add stat()?