Re: Question concerning this list [WebCrawler]

2007-01-01 Thread Diez B. Roggisch
Thomas Ploch schrieb: > John Nagle schrieb: >> Very true. HTML is LALR(0), that is, you can parse it without >> looking ahead. Parsers for LALR(0) languages are easy, and >> work by repeatedly getting the next character and using that to >> drive a single state machine. The first character-l

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Thomas Ploch
John Nagle schrieb: > > Very true. HTML is LALR(0), that is, you can parse it without > looking ahead. Parsers for LALR(0) languages are easy, and > work by repeatedly getting the next character and using that to > drive a single state machine. The first character-level parser > yields toke

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread John Nagle
Thomas Ploch wrote: > Marc 'BlackJack' Rintsch schrieb: > >>In <[EMAIL PROTECTED]>, Thomas Ploch >>wrote: >>>Alright, my prof said '... to process documents written in structural >>>markup languages using regular expressions is a no-no.' Very true. HTML is LALR(0), that is, you can parse it

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, Thomas Ploch wrote: > This is how my regexes look like: > > import re > > class Tags: > def __init__(self, sourceText): > self.source = sourceText > self.curPos = 0 > self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*" > self.tagPattern = re

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Thomas Ploch
Marc 'BlackJack' Rintsch schrieb: > In <[EMAIL PROTECTED]>, Thomas Ploch > wrote: > >> Alright, my prof said '... to process documents written in structural >> markup languages using regular expressions is a no-no.' (Because of >> nested Elements? Can't remember) So I think he wants us to use rege