如果您使用默认设置,则词形还原器需要正确的 POS 标签才能准确WordNetLemmatizer.lemmatize()
,默认标签是名词,参见https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39 https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
要解决该问题,请始终在词形还原之前对数据进行 POS 标记,例如
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
请注意“is -> be”,即
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
用你的例子中的词来回答这个问题:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
请注意,WordNetLemmatizer 有一些怪癖:
-
python 中的 wordnet 词形还原和词性标记 https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
- Python NLTK 使用 wordnet 对单词“further”进行词形化 https://stackoverflow.com/questions/22999273/python-nltk-lemmatization-of-the-word-further-with-wordnet
此外,NLTK 的默认词性标注器正在进行一些重大更改以提高准确性:
- Python NLTK pos_tag 未返回正确的词性标记 https://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag
- https://github.com/nltk/nltk/issues/1110 https://github.com/nltk/nltk/issues/1110
- https://github.com/nltk/nltk/pull/1143 https://github.com/nltk/nltk/pull/1143
对于词形还原器的开箱即用/现成解决方案,您可以看看https://github.com/alvations/pywsd https://github.com/alvations/pywsd以及我如何进行一些 try- excepts 来捕获 WordNet 中不存在的单词,请参阅https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66 https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66