Re: Question concerning this list [WebCrawler]

John Nagle Sun, 31 Dec 2006 09:51:01 -0800

Thomas Ploch wrote:
> Marc 'BlackJack' Rintsch schrieb:
> 
>>In <[EMAIL PROTECTED]>, Thomas Ploch
>>wrote:


>>>Alright, my prof said '... to process documents written in structural
>>>markup languages using regular expressions is a no-no.'

    Very true.  HTML is LALR(0), that is, you can parse it without
looking ahead.  Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine.  The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

    Using regular expressions for LALR(0) parsing is a vice inherited
from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive.  In Python, at least
you can index through a string.

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Question concerning this list [WebCrawler]

Reply via email to