使用正则表达式对使用“但是”一词的句子进行分块

2024-01-21

我正在尝试使用正则表达式在“但是”一词(或任何其他协调连词)处对句子进行分块。它不起作用...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

我需要拆分句子"There are no large collections present but there is spinal canal stenosis."分为两部分:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

我还希望使用相同的代码在“and”和其他并列连词(CC)词处分割句子。但我的代码不起作用。请帮忙。


我认为你可以简单地做

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

where

`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
)
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

您可以在其中添加更多连词,并用竖线字符分隔|。 但请注意,这些单词不包含在正则表达式中具有特殊含义的字符。如果有疑问,请先使用以下命令逃避它们re.escape(word)

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用正则表达式对使用“但是”一词的句子进行分块 的相关文章

随机推荐