使用 Pyparsing 访问解析的元素

2023-12-25

我有一堆句子需要解析并转换为相应的正则表达式搜索代码。我的句子的例子 -

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

-这意味着在该行中,phrase one之前来到某个地方phrase2 and phrase3。此外,该行必须以Therefore we

LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr

-这意味着我需要在前 2 个短语之间最多允许 4 个单词 最后 2 个短语之间最多 3 个单词

使用保罗·麦奎尔的帮助(here https://stackoverflow.com/q/42415837/4169943),编写了以下语法 -

from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString, 
    infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)

LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
    """LINE_CONTAINS LINE_STARTSWITH """.split()) 

NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())

lpar=Suppress('{') 
rpar=Suppress('}')

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
                      BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use

phrase_word = ~keyword + (Word(alphanums + '_'))

upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)

phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))  



phrase_expr = infixNotation(phrase_term,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                             (NOT, 1, opAssoc.RIGHT,),
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            ) # structure of a single phrase with its operators

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
                                   [(NOT, 1, opAssoc.RIGHT,),
                                    (AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   ) # grammar for the entire rule/sentence

sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""

我现在的问题是 - 如何访问解析的元素以便将句子转换为我的正则表达式代码。为此,我尝试了以下方法 -

parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)

上述代码的结果为sample1 was -

{}

[[['行包含', [[['句子', '一个'], '之前', [['句子 2'], 'AND', ['句子 3']]]]], 'AND' , ['行开头', [['因此', '我们']]]]]

([([(['行包含', ([([(['句子', '一个'], {}), '之前', ([(['句子 2'], {}), 'AND ', (['句子3'], {})], {})], {})], {})], {'phrase': [(([([(['句子', '一个'], {}), '之前', ([(['句子 2'], {}), 'AND', (['句子 3 '], {})], {})], {})], {}), 1)], '行指令': [('LINE_CONTAINS', 0)]}), '和', (['LINE_STARTSWITH', ([(['因此', '我们'], {})], {})], {'phrase': [(([(['因此', '我们'], {})], {}), 1)], '行指令': [('LINE_STARTSWITH', 0)]})], {})], {})

上述代码的结果为sample2 was -

{'phrase': [[['A B C D', {'字数': 4}, 'xyzw', {'字数': 3}, 'pqrs'], '之前', ['某事', '其他']]], '行指令': 'LINE_CONTAINS'}

[['LINE_CONTAINS', [[['abcd', ['upto', 4, '单词'], 'xyzw', ['upto', 3、'单词']、'pqrs']、'之前'、['某事'、'其他']]]]]

([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, '单词'], {'字数': [(4, 1)]}), 'xyzw', (['upto', 3, '单词'], {'字数': [(3, 1)]}), 'pqrs'], {}), '之前', (['某事', 'else'], {})], {})], {})], {'短语': [(([([(['abcd', (['upto', 4, '字'], {'字数': [(4, 1)]}), 'xyzw', (['upto', 3, '单词'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), '之前', (['某事', '其他'], {})], {})], {}), 1)], '行指令': [('LINE_CONTAINS', 0)]})], {})

根据上述输出我的问题是 -

  1. 为什么pprint(漂亮打印)比普通打印有更详细的解析?
  2. 为什么asDict()方法没有给出输出sample1但确实是为了sample2?
  3. 每当我尝试使用访问已解析的元素时print (parsed.numberofwords) or parsed.line_directive or parsed.line_term,它没有给我任何东西。我如何访问这些元素以便使用它们来构建我的正则表达式代码?

解答您的打印问题。 1)pprint是否可以漂亮地打印嵌套的标记列表,而不显示任何结果名称 - 它本质上是调用的包装pprint.pprint(results.asList()). 2) asDict()是否可以将解析结果转换为实际的 Python 字典,所以它only显示结果名称(如果名称中有名称,则进行嵌套)。

要查看解析输出的内容,最好使用print(result.dump()). dump()显示结果的嵌套and一路上的任何命名结果。

result = line_contents_expr.parseString(sample2)
print(result.dump())

我也推荐使用expr.runTests为你带来dump()输出以及任何异常和异常定位器。通过您的代码,您可以最轻松地使用以下命令来完成此操作:

line_contents_expr.runTests([sample1, sample2])

但我也建议你退后一步,想一想这到底是什么{upto n words}商业就是一切。查看示例并围绕行术语绘制矩形,然后在行术语内围绕短语术语绘制圆圈。 (这将是一个很好的练习,可以帮助您自己编写该语法的 BNF 描述,我总是建议您将其作为解决问题的步骤。)如果您将upto表达式作为另一个运算符?要看到这一点,请更改phrase_term回到你原来的样子:

phrase_term = Group(OneOrMore(phrase_word))

然后将定义短语表达式时的第一个优先条目更改为:

    ((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),

或者考虑一下也许有upto运算符的优先级高于或低于 BEFORE、AFTER 和 JOIN,并相应地调整优先级列表。

通过此更改,我通过对示例调用 runTests 获得以下输出:

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
[0]:
  [['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]
  [0]:
    ['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]]
    - line_directive: 'LINE_CONTAINS'
    - phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]
      [0]:
        [['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]
        [0]:
          ['phrase', 'one']
        [1]:
          BEFORE
        [2]:
          [['phrase2'], 'AND', ['phrase3']]
          [0]:
            ['phrase2']
          [1]:
            AND
          [2]:
            ['phrase3']
  [1]:
    AND
  [2]:
    ['LINE_STARTSWITH', [['Therefore', 'we']]]
    - line_directive: 'LINE_STARTSWITH'
    - phrase: [['Therefore', 'we']]
      [0]:
        ['Therefore', 'we']



LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else

[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]]
[0]:
  ['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]
  - line_directive: 'LINE_CONTAINS'
  - phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]
    [0]:
      [['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]
      [0]:
        ['abcd']
      [1]:
        ['upto', 4, 'words']
        - numberofwords: 4
      [2]:
        ['xyzw']
      [3]:
        ['upto', 3, 'words']
        - numberofwords: 3
      [4]:
        ['pqrs']
      [5]:
        BEFORE
      [6]:
        ['something', 'else']

您可以迭代这些结果并将它们分开,但是您很快就到达了应该考虑从不同优先级构建可执行节点的地步 - 请参阅 pyparsing wiki 上的 SimpleBool.py 示例了解如何执行此操作。

编辑:请查看这个解析器的简化版本phrase_expr,以及它如何创建Node本身生成输出的实例。怎么看numberofwords是在操作符上访问的UpToNode班级。了解如何使用隐式 AND 运算符将“xyz abc”解释为“xyz AND abc”。

from pyparsing import *
import re

UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()

word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)

class Node(object):
    def __init__(self, tokens):
        self.tokens = tokens

    def generate(self):
        pass

class LiteralNode(Node):
    def generate(self):
        return "(%s)" % re.escape(self.tokens[0])
    def __repr__(self):
        return repr(self.tokens[0])

class AndNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '.*'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])

class OrNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '|'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])

class UpToNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        ret = tokens[0].generate()
        word_re = r"\s+\S+"
        space_re = r"\s+"
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
        return ret

    def __repr__(self):
        tokens = self.tokens[0]
        ret = repr(tokens[0])
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
        return ret

IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))

phrase_expr = infixNotation(word.setParseAction(LiteralNode),
        [
        (upto_expr, 2, opAssoc.LEFT, UpToNode),
        (AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
        (OR, 2, opAssoc.LEFT, OrNode),
        ])

tests = """\
        xyz
        xyz abc
        xyz {upto 4 words} def""".splitlines()

for t in tests:
    t = t.strip()
    if not t:
        continue
    print(t)
    try:
        parsed = phrase_expr.parseString(t)
    except ParseException as pe:
        print(' '*pe.loc + '^')
        print(pe)
        continue
    print(parsed)
    print(parsed[0].generate())
    print()

prints:

xyz
['xyz']
(xyz)

xyz abc
['xyz' AND 'abc']
(xyz).*(abc)

xyz {upto 4 words} def
['xyz' {0-4 WORDS} 'def']
(xyz)((\s+\S+){0,4}\s+)(def)

在此基础上扩展以支持您的LINE_xxx表达式。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用 Pyparsing 访问解析的元素 的相关文章

随机推荐