我觉得OP的方法已经很好了。然而, ...
对于OP的两个步骤,有......
bycols = c("NAME", "DOB")
dcol = "ID"
cols = c(bycols, dcol)
w1 = customers[customers[, .N, by=bycols][N > 1L, !"N"], on=bycols, which=TRUE]
customers[w1][!customers[w1, .N, by=cols][N > 1L, !"N"], on=cols]
对于引用的任务...
mDT = customers[!duplicated(customers, by=cols), .N, by=bycols][N > 1L]
customers[mDT[, !"N"], on=bycols]
不管怎样,对于OP的例子,我们得到
NAME DOB ID PIN
1: MARIAM SUDHAKAR 24-01-1954 502475 500032
2: MARIAM SUDHAKAR 24-01-1954 502047 500032
3: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
4: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
5: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
6: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
7: B ANNAPURNA 12-12-1962 502878 500084
8: B ANNAPURNA 12-12-1962 502877 500084
9: DARBAR ASHOK 03-01-1950 502639 500032
10: DARBAR ASHOK 03-01-1950 502548 500084
11: KOTLA CHENNAMMA 15-07-1958 502466 500032
12: KOTLA CHENNAMMA 15-07-1958 502467 500032
mDT
是一个描述重复项的汇总表,方便浏览:
> mDT
NAME DOB N
1: MARIAM SUDHAKAR 24-01-1954 2
2: VELPURI LAKSHMI SUJATHA 28-06-1971 2
3: VENKATESHWARAN PONNAMBALAM 14-04-1969 2
4: B ANNAPURNA 12-12-1962 2
5: DARBAR ASHOK 03-01-1950 2
6: KOTLA CHENNAMMA 15-07-1958 2
__San__ 编辑(原海报):
修改数据集以显示如何处理 Frank 之前指出的情况:“ID = A A B 的组将失去两个 A 并保留 B”
library(data.table)
# load the data
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA",
"KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS",
"UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR",
"B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK",
"AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
"PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
"BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969",
"23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960",
"29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964",
"21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962",
"10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958",
"26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
"15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
"24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
"23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979",
"31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
"14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
row.names = c(NA, -50L), class = c("data.table", "data.frame"))
# define function for duplicate key exclusion
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
return(ans)
}
# call function
result <- dupKeyEx(customers, c("NAME", "DOB"), "ID")
result
结果告诉我们,B V RAMANA(相同的姓名和出生日期)已获得多个 ID,并显示这些不同的 ID,如下所示:
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502877 500084
2: B ANNAPURNA 12-12-1962 502878 500084
3: B V RAMANA 15-01-1960 502902 500032
4: B V RAMANA 15-01-1960 502910 500033
5: DARBAR ASHOK 03-01-1950 502548 500084
6: DARBAR ASHOK 03-01-1950 502639 500032
7: KOTLA CHENNAMMA 15-07-1958 502466 500032
8: KOTLA CHENNAMMA 15-07-1958 502467 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: MARIAM SUDHAKAR 24-01-1954 502475 500032
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
如果 OP 方法用于此修改后的数据集,则 B V RAMANA 具有相同 ID 的两行将丢失,并且结果将仅显示颁发给 B V RAMANA 的一个(总共三个)ID,因为使用该方法“ID = A A B 的一组”将失去两个 A 并保留 B”(引用 Frank 的话)。这同样适用于 Uwe Block 的其他 data.table 解决方案。不符合目的的结果如下所示。
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502878 500084
2: B ANNAPURNA 12-12-1962 502877 500084
3: B V RAMANA 15-01-1960 502910 500033
4: DARBAR ASHOK 03-01-1950 502639 500032
5: DARBAR ASHOK 03-01-1950 502548 500084
6: KOTLA CHENNAMMA 15-07-1958 502466 500032
7: KOTLA CHENNAMMA 15-07-1958 502467 500032
8: MARIAM SUDHAKAR 24-01-1954 502475 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
直到有人想到新方法失败的情况,我认为我们已经找到了“重复键排除”的正确解决方案。 Frank 提供的出色的 catch 和 data.table 解决方案。
对于涵盖由于前导或尾随空格而导致值不匹配的情况的扩展,请参阅OP的后续内容如何引用函数内变量中保存的多个列名,它使用stringr::str_trim()
并得出结论:
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
chr_cols <- cols[sapply(DT[, ..cols], is.character)]
DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
setorderv(ans, c(dup_cols, unique_cols))
return(ans)
}