我之前的更新post https://stackoverflow.com/questions/49064114/extracting-n-grams-from-tweets-in-python,进行一些更改:
假设我有 100 条推文。
在这些推文中,我需要提取:1)食物名称,2)饮料名称。我还需要为每次提取附加类型(饮料或食物)和 ID 号(每个项目都有一个唯一的 ID)。
我已经有一个包含名称、类型和 ID 号的词典:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
推文示例:
经过对“tweet_1”的各种处理后,我有以下句子:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
我请求的输出(可以是其他type than list):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
重要的是输出应该NOT提取 ngram 中的 unigram (n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
理想情况下,我希望能够在各种 nltk 过滤器中运行我的句子,例如 lemmatize() 和 pos_tag()BEFORE提取以获得如下输出。但是使用这个正则表达式解决方案,如果我这样做,那么所有单词都会被分割成一元组,或者它们将从字符串“coca cola”生成 1 个一元组和 1 个二元组,这将生成我不想要的输出(如上面的例子)。
理想的输出(再次type输出的大小并不重要):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]