如果您只想匹配完整单词,则需要使用单词边界标记,否则前缀(和后缀)也会匹配。例如:
import pandas as pd
df = pd.DataFrame({
'opinions':[
"I think the movie is fantastic. Shame it's so short!",
"How did they make it?",
"I had a fantastic time at the cinema last night!",
"I really disliked the cast",
"the film was sad and boring",
"Absolutely loved the movie! Can't wait to see part 2",
"He has greatness within"
]
})
keywords = ['movie', 'great', 'fantastic', 'loved']
query = '|'.join(keywords)
df['word'] = df['opinions'].str.findall(r'\b({})\b'.format(query))
print(df)
Output
opinions word
0 I think the movie is fantastic. Shame it's so ... [movie, fantastic]
1 How did they make it? []
2 I had a fantastic time at the cinema last night! [fantastic]
3 I really disliked the cast []
4 the film was sad and boring []
5 Absolutely loved the movie! Can't wait to see ... [loved, movie]
6 He has greatness within []
在上面的例子中greatness
由于单词边界而未匹配(\b
).
关于性能的说明
作为旁注,如果您正在寻找大数据的有效解决方案,联合正则表达式并不是一个好方法(请参阅here)。我建议你使用一个库,例如trrex.
import pandas as pd
import trrex as tx
df = pd.DataFrame({
'opinions': [
"I think the movie is fantastic. Shame it's so short!",
"How did they make it?",
"I had a fantastic time at the cinema last night!",
"I really disliked the cast",
"the film was sad and boring",
"Absolutely loved the movie! Can't wait to see part 2",
"He has greatness within"
]
})
keywords = ['movie', 'great', 'fantastic', 'loved']
query = tx.make(keywords, left=r"\b(", right=r")\b")
df['word'] = df['opinions'].str.findall(r'{}'.format(query))
print(df)
Output (使用 trrex)
opinions word
0 I think the movie is fantastic. Shame it's so ... [movie, fantastic]
1 How did they make it? []
2 I had a fantastic time at the cinema last night! [fantastic]
3 I really disliked the cast []
4 the film was sad and boring []
5 Absolutely loved the movie! Can't wait to see ... [loved, movie]
6 He has greatness within []
For a comparison on performance see the image below:
![enter image description here](https://i.stack.imgur.com/EBPsb.png)
对于一组 25K 字的集合,trrex 比联合正则表达式快 300 倍。上图中的实验可以用以下命令重现gist
免责声明:我是trrex的作者