Re: Regular Expressions

skip Mon, 12 Feb 2007 09:18:26 -0800

    dbl> The source of HTMLParser and xmllib use regular expressions for
    dbl> parsing out the data. htmllib calls sgmllib at the begining of it's
    dbl> code--sgmllib starts off with a bunch of regular expressions used
    dbl> to parse data.


I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").

If I have a simple expression:

    (7 + 3.14) * CONST

that's just a stream of bytes, "(", "&", " ", "+", ...  Lexical analysis
chunks that stream of bytes into the "words" of the language:

    LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")

Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes).  That representation is
application-dependent.

Regular expressions are ideal for lexical analysis.  They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.

Here are a couple much better expositions on the topics:

    http://en.wikipedia.org/wiki/Lexical_analysis
    http://en.wikipedia.org/wiki/Parsing

Skip

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expressions

Reply via email to