如何在 R 中执行近似(模糊)名称匹配


我有一个专门用于生物学期刊的大型数据集,该数据集是由不同的人长时间编写的。因此,数据不采用单一格式。例如,在“作者”栏中我可以找到John Smith、Smith John、Smith J等,但它们是同一个人。我连最简单的动作都做不了。例如,我无法弄清楚哪些作者写的文章最多。


有一些软件包可以帮助您解决此问题,其中一些在评论中列出。但是,如果你不想使用这些,我想我会尝试用 R 编写一些可能对你有帮助的东西。该代码会将“John Smith”与“J Smith”、“John Smith”、“Smith John”、“John S”相匹配。同时,它不会匹配“John Sally”之类的内容。

# generate some random names
names = c(
  "John Smith", 
  "Wigberht Ernust",
  "Samir Henning",
  "Everette Arron",
  "Erik Conor",
  "Smith J",
  "Smith John",
  "John S",
  "John Sally"

# split those names and get all ways to write that name
split_names = lapply(
  X = names,
  FUN = function(x){
    # split by a space
    c_split = unlist(x = strsplit(x = x, split = " "));
    # get both combinations of c_split to compensate for order
    c_splits = list(c_split, rev(x = c_split));
    # return c_splits

# suppose we're looking for John Smith
search_for = "John Smith";

# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));

# initialise a vector containing if search_for was matched in names
match_statuses = c();

# for each name that's been split
for(i in 1:length(x = names)){

  # the match status for the current name
  match_status = FALSE;

  # the current split name
  c_split_name = split_names[[i]];

  # for each element in search_for_split
  for(j in 1:length(x = search_for_split)){

    # the current combination of name
    c_search_for_split_names = search_for_split[[j]];

    # for each element in c_split_name
    for(k in 1:length(x = c_split_name)){

      # the current combination of current split name
      c_c_split_name = c_split_name[[k]];

      # if there's a match, or the length of grep (a pattern finding function is
      # greater than zero)
        # is c_search_for_split_names first element in c_c_split_name first
        # element
          x = grep(
            pattern = c_search_for_split_names[1],
            x = c_c_split_name[1]
        ) > 0 &&
        # is c_search_for_split_names second element in c_c_split_name second 
        # element
          x = grep(
            pattern = c_search_for_split_names[2],
            x = c_c_split_name[2]
        ) > 0 ||
        # or, is c_c_split_name first element in c_search_for_split_names first 
        # element
          x = grep(
            pattern = c_c_split_name[1],
            x = c_search_for_split_names[1]
        ) > 0 &&
        # is c_c_split_name second element in c_search_for_split_names second 
        # element
          x = grep(
            pattern = c_c_split_name[2],
            x = c_search_for_split_names[2]
        ) > 0
        # if this is the case, update match status to TRUE
        match_status = TRUE;
      } else {
        # otherwise, don't update match status

  # append match_status to the match_statuses list
  match_statuses = c(match_statuses, match_status);


[1] "John Smith"

cbind(names, match_statuses);

     names             match_statuses
[1,] "John Smith"      "TRUE"        
[2,] "Wigberht Ernust" "FALSE"       
[3,] "Samir Henning"   "FALSE"       
[4,] "Everette Arron"  "FALSE"       
[5,] "Erik Conor"      "FALSE"       
[6,] "Smith J"         "TRUE"        
[7,] "Smith John"      "TRUE"        
[8,] "John S"          "TRUE"
[9,] "John Sally"      "FALSE"   



  • forR 中的循环可能很慢。如果您正在使用很多名称,请查看Rcpp.

  • 您可能希望将其包装在一个函数中。然后,您可以通过调整将其应用于不同的名称search_for.

  • 此示例存在时间复杂性问题,并且根据数据的大小,您可能想要/需要重新设计它。


