Chris, Thanks for your quick answer. That changes a lot of stuff, and now I'm able to do my parsing as I intended to.
Paul, Thanks for your detailed explanation. One of the things I think is missing from the documentation (or that I couldn't find easy) is the kind of explanation you give about 'The Way of PyParsing'. For example, It took me a while to understand that I could easily implement simple recursions using OneOrMany(Group()). Or maybe things were out there and I didn't searched enough... Still, fwiw, congratulations for the library. PyParsing allowed me to do in just a couple of hours, including learning about it's API (minus this little inconvenient) what would have taken me a couple of days with, for example, ANTLR (in fact, I've already put aside ANTLR more than once in the past for a built-from-scratch parser). Cheers, Hugo Ferreira On 11/22/06, Paul McGuire <[EMAIL PROTECTED]> wrote:
"Bytter" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi, > > I'm trying to construct a parser, but I'm stuck with some basic > stuff... For example, I want to match the following: > > letter = "A"..."Z" | "a"..."z" > literal = letter+ > include_bool := "+" | "-" > term = [include_bool] literal > > So I defined this as: > > literal = Word(alphas) > include_bool = Optional(oneOf("+ -")) > term = include_bool + literal > > The problem is that: > > term.parseString("+a") -> (['+', 'a'], {}) # OK > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't > recognize any token since I didn't said the SPACE was allowed between > include_bool and literal. > As Chris pointed out in his post, the most direct way to fix this is to use Combine. Note that Combine does two things: it requires the expressions to be adjacent, and it combines the results into a single token. For instance, when defining the expression for a real number, something like: realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums) Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.', '14159']. For this grammar, pyparsing would also accept "2. 23" as ['', '2', '.', '23'], even though there is a space between the decimal point and "23". But by wrapping it inside Combine, as in: realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)) we accomplish two things: pyparsing only matches if all the elements are adjacent, with no whitespace or comments; and the matched token is returned as ['3.14159']. (Yes, I left off scientific notation, but it is an extension of the same issue.) Pyparsing in general does implicit whitespace skipping; it is part of the zen of pyparsing, and distinguishes it from conventional regexps (although I think there is a new '?' switch for re's that puts '\s*'s between re terms for you). This is to simplify the grammar definition, so that it doesn't need to be littered with "optional whitespace or comments could go here" expressions; instead, whitespace and comments (or "ignorables" in pyparsing terminology) are parsed over before every grammar expression. I instituted this out of recoil from a previous project, in which a co-developer implemented a boolean parser by first tokenizing by whitespace, then parsing out the tokens. Unfortunately, this meant that "color=='blue' && size=='medium'" would not parse successfully, instead requiring "color == 'blue' && size == 'medium'". It doesn't seem like much, but our support guys got many calls asking why the boolean clauses weren't matching. I decided that when I wrote a parser, "y=m*x+b" would be just as parseable as "y = m * x + b". For that matter, you'd be surprised where whitespace and comments sneak in to people's source code: spaces after left parentheses and comments after semicolons, for example, are easily forgotten when spec'ing out the syntax for a C "for" statement; whitespace inside HTML tags is another unanticipated surprise. So looking at your grammar, you say you don't want to have this be a successful parse: term.parseString("+ a") -> (['+', 'a'], {}) because, "It shouldn't recognize any token since I didn't said the SPACE was allowed between include_bool and literal." In fact, pyparsing allows spaces by default, that's why the given parse succeeds. I would turn this question around, and ask you in terms of your grammar - what SHOULD be allowed between include_bool and literal? If spaces are not a problem, then your grammar as-is is sufficient. If spaces are absolutely verboten, then there are 2 or 3 different techniques in pyparsing to disable the whitespace-skipping behavior, depending on whether you want all whitespace skipping disabled, just for literals of a certain type, or just for literals when following a leading include_bool sign. Thanks for giving pyparsing a try; if you want further help, you can post here, or on the pyparsing wiki - the discussion threads on the Home page are a pretty good support and message log. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
-- GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85
-- http://mail.python.org/mailman/listinfo/python-list