要比较字符串之间的相似性,第一步通常是用您拥有的最佳知识清理数据:
由于许多计算字符串距离的方法会将大写和小写字母视为不同的字母,因此首先应将所有字符转换为相同的大小写。您还可以进行任何其他清洁以帮助提高准确性。
library(dplyr)
companyName <- company$CompanyName %>%
toupper() %>% # convert to upper case
stringr::str_replace_all("\\s+"," ") %>% # convert any consecutive whitespaces to single space
stringr::str_remove_all("\\.|,") # remove all comma or dot
> companyName
[1] "MERCK SHARP & DOHME CORPORATION" "GILEAD SCIENCES INC" "BOEHRINGER INGELHEIM PHARMACEUTICALS INC"
[4] "ABBVIE INC" "JANSSEN SCIENTIFIC AFFAIRS LLC" "BOEHRINGER INGELHEIM PHARMA GMBH & COKG"
[7] "ASAHI INTECC CO LTD" "ASAHI INTECC USA INC"
计算字符串距离:
distanceMatrix <- stringdist::stringdistmatrix(
a = companyName,
b = companyName,
# You can pick the method that works best for your data. Also, manual inspection is needed. See ?stringdist
# I'm picking soundex for this example
method = "soundex"
)
通过使用soundex
方法,如果一个细胞是0
,表示对应的行和列非常接近
> distanceMatrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0 1 1 1 1 1 1 1
[2,] 1 0 1 1 1 1 1 1
[3,] 1 1 0 1 1 0 1 1
[4,] 1 1 1 0 1 1 1 1
[5,] 1 1 1 1 0 1 1 1
[6,] 1 1 0 1 1 0 1 1
[7,] 1 1 1 1 1 1 0 0
[8,] 1 1 1 1 1 1 0 0
这意味着,在companyName
向量,第 3 项接近第 6 项,第 7 项接近第 8 项。
result <- which(distanceMatrix==0,arr.ind = TRUE) %>%
as.data.frame() %>%
dplyr::filter(col > row)
> result
row col
1 3 6
2 7 8
> result %>% mutate_all(~companyName[.x])
row col
1 BOEHRINGER INGELHEIM PHARMACEUTICALS INC BOEHRINGER INGELHEIM PHARMA GMBH & COKG
2 ASAHI INTECC CO LTD ASAHI INTECC USA INC
请注意,您可以通过清理字符串或在计算字符串距离时选择不同的方法、参数或阈值来提高准确性。但它永远不能保证 100% 的准确性。
最后,要计算独特的公司,我们可以这样做:
> length(companyName) - length(unique(result$row))
[1] 6