如果我有一组句子并且我想提取重复项,我应该像下面的示例一样工作:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
返回:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
但就我而言,我有一些彼此相似的句子(例如,由于拼写错误),我想选择彼此更相似的句子。例如:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
根据这个例子,我想在以下每一对中选择一个:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
The levenshteinSim
函数在RecordLinkage
包可以帮助我:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
依此类推,对于最相似的句子返回接近 1 的值。我可以写一个双for loop
并选择例如编辑距离大于 0.7 的句子对(例如)。但是,难道没有更简单的方法吗?