dbl> The source of HTMLParser and xmllib use regular expressions for dbl> parsing out the data. htmllib calls sgmllib at the begining of it's dbl> code--sgmllib starts off with a bunch of regular expressions used dbl> to parse data.
I am almost certain those modules use regular expressions for lexical analysis (splitting the input byte stream into "words"), not for parsing (extracting the structure of the "sentences"). If I have a simple expression: (7 + 3.14) * CONST that's just a stream of bytes, "(", "&", " ", "+", ... Lexical analysis chunks that stream of bytes into the "words" of the language: LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST") Parsing then constructs a higher level representation of that stream of "words" (more commonly called tokens or lexemes). That representation is application-dependent. Regular expressions are ideal for lexical analysis. They are not-so-hot for parsing unless the grammar of the language being parsed is *extremely* simple. Here are a couple much better expositions on the topics: http://en.wikipedia.org/wiki/Lexical_analysis http://en.wikipedia.org/wiki/Parsing Skip -- http://mail.python.org/mailman/listinfo/python-list