On Aug 26, 8:05 pm, "Ryan Ginstrom" <[EMAIL PROTECTED]> wrote: > > On Behalf Of Jason Evans > > Parsers typically deal with tokens rather than individual > > characters, so the scanner that creates the tokens is the > > main thing that Unicode matters to. I have written > > Unicode-aware scanners for use with Parsing-based parsers, > > with no problems. This is pretty easy to do, since Python > > has built-in support for Unicode strings. > > The only caveat being that since Chinese and Japanese scripts don't > typically delimit "words" with spaces, I think you'd have to pass the text > through a tokenizer (like ChaSen for Japanese) before using PyParsing. > > Regards, > Ryan Ginstrom
Did you think pyparsing is so mundane as to require spaces between tokens? Pyparsing has been doing this type of token-recognition since Day 1. Looking for tokens without delimiting spaces was one of the first applications for pyparsing. This issue is not unique to Chinese or Japanese text. Pyparsing will easily find the tokens in this string: y=a*x**2+b*x+c as ['y','=','a','*','x','**','2','+','b','*','x','+','c'] even though there is not a single delimiting space. But pyparsing will also render this as a nested parse tree, reflecting the precedence of operations: ['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+', 'c']] and will allow you to access individual tokens by field name: - lhs: y - rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c'] Please feel free to look through the posted examples on the pyparsing wiki at http://pyparsing.wikispaces.com/Examples, or some of the applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing, and you might get a better feel for what kind of tasks pyparsing is capable of. -- Paul -- http://mail.python.org/mailman/listinfo/python-list