On 2007-11-04, Kay Schluehr <[EMAIL PROTECTED]> wrote: > On 4 Nov., 03:07, Neil Cerutti <[EMAIL PROTECTED]> wrote: >> I wouldn't characterize it as pretending. How would you parse: >> >> hello end hello end >> >> "WORD END WORD END" and "WORD WORD WORD END" are both valid >> interpretations, according to the grammar. >> >> As soon as you remove the ambiguity from the grammar, PyParsing >> starts to work correctly. > > I think you are correct about this. But I'm not sure how much > it shall matter. Just take a look at Pythons Grammar > > http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup > > Without special keyword treatment this grammar would be > ambigous and couldn't be parsed using an LL(1) parser.
I agree. I don't know how easy it is to create keywords using PyParsing, or if the grammar in question would still be considered correct by the author. > The "grammar compiler" which builds the parser tables creates a > special label for each keyword. This label is filtered when a > NAME token is feeded into the parser. With the label that > belongs to e.g. 'if' or 'while' the correct statement can be > selected in constant time. Same happens when I use the parser > generator with your EBNF grammar. With a little more adaption > also NUMBER token could be filtered. But this would be > overdesign. > > Theoretical beauty is compromised here using reasonable default > assumptions for keeping the grammar simple ( "convention over > configuration" to borrow a Rails slogan ). Keywords are practically ubiquitous. I never thought of them as unbeautiful before. > Tokenization is another issue in Python. It is indeed somewhat > special due to significant whitespace and line continuation but > tokenization is conceptually much simpler and one would likely > throw all kinds of options and special case handling in the > lexical analysis phase. It might be a quick fix in PyParsing, which includes a Keyword type, but without the semantics that are needed in this case. You have to use (as suggested earlier) negative lookahead in either a Regex or with the NotAny type. >>> goal = OneOrMore(NotAny(Literal('end')) + Word(alphas)) + Literal('end') >>> goal.parseString('hello hello hello end') (['hello', 'hello', 'hello', 'end'], {}) >>> goal.parseString('hello end hello end') (['hello', 'end'], {}) No scanner/tokenizer needed! ;) -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list