Thomas Ploch wrote: > Marc 'BlackJack' Rintsch schrieb: > >>In <[EMAIL PROTECTED]>, Thomas Ploch >>wrote:
>>>Alright, my prof said '... to process documents written in structural >>>markup languages using regular expressions is a no-no.' Very true. HTML is LALR(0), that is, you can parse it without looking ahead. Parsers for LALR(0) languages are easy, and work by repeatedly getting the next character and using that to drive a single state machine. The first character-level parser yields tokens, which are then processed by a grammar-level parser. Any compiler book will cover this. Using regular expressions for LALR(0) parsing is a vice inherited from Perl, in which regular expressions are easy and "get next character from string" is unreasonably expensive. In Python, at least you can index through a string. John Nagle -- http://mail.python.org/mailman/listinfo/python-list