我有一些包含文本的数据,我想尝试从文本中提取公司名称。数据如下:
d <- data.frame(
textColumn = c(
"Apple CEO reports positive growth in Iphone sales",
"Apple's quarterly results are expected to beat that of Intel's",
"Microsoft is expected to release a new product which rivales Apple's Iphone which uses Intels processors",
"Intel Corporation seeks to hire 5000 new staff",
"Amazon enters a new market, the same as Intel"
)
)
Data:
textColumn
1 Apple CEO reports positive growth in Iphone sales
2 Apple's quarterly results are expected to beat that of Intel's
3 Microsoft is expected to release a new product which rivales Apple's Iphone
4 Intel Corporation seeks to hire 5000 new staff
5 Amazon enters a new market, the same as Intel
在向量中我有许多公司名称。
companyNames <- c(
"Apple Inc",
"Intel Corp",
"Microsoft Corporation",
"Amazon Company"
)
Data:
[1] "Apple Inc" "Intel Corp" "Microsoft Corporation" "Amazon Company"
文本中的数据不允许我准确提取公司名称,因为字符串主要包含完整的公司名称Apple Inc
, Intel Corp
等等,但文本数据仅指公司Apple
and Intel
etc.
我想使用模糊字符串提取来尝试从文本中提取公司名称,因此使用此示例的预期输出将是:
c(
"Apple",
"Apple | Intel",
"Microsoft | Apple | Intel",
"Intel",
"Amazon | Intel"
)
Data:
[1] "Apple" "Apple | Intel" "Microsoft | Apple | Intel" "Intel" "Amazon | Intel"
Since Apple
仅在文本数据的第一行中出现一次,而Apple
and Intel
两者都出现在第二行(所以我将它们分开|
)。我正在调查fuzzyExtract
来自fuzzywuzzyR
pakage here https://rdrr.io/cran/fuzzywuzzyR/man/FuzzExtract.html但我似乎无法让它在我的样本数据上工作。