Re: Question concerning this list [WebCrawler]

2007-01-01 Thread Diez B. Roggisch
Thomas Ploch schrieb: > John Nagle schrieb: >> Very true. HTML is LALR(0), that is, you can parse it without >> looking ahead. Parsers for LALR(0) languages are easy, and >> work by repeatedly getting the next character and using that to >> drive a single state machine. The first character-l

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Thomas Ploch
John Nagle schrieb: > > Very true. HTML is LALR(0), that is, you can parse it without > looking ahead. Parsers for LALR(0) languages are easy, and > work by repeatedly getting the next character and using that to > drive a single state machine. The first character-level parser > yields toke

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread John Nagle
Thomas Ploch wrote: > Marc 'BlackJack' Rintsch schrieb: > >>In <[EMAIL PROTECTED]>, Thomas Ploch >>wrote: >>>Alright, my prof said '... to process documents written in structural >>>markup languages using regular expressions is a no-no.' Very true. HTML is LALR(0), that is, you can parse it

Re: WebCrawler (was: 'Question concerning this list')

2006-12-31 Thread Thomas Ploch
Marc 'BlackJack' Rintsch schrieb: > In <[EMAIL PROTECTED]>, Thomas Ploch > wrote: > >> This is how my regexes look like: >> >> import re >> >> class Tags: >> def __init__(self, sourceText): >> self.source = sourceText >> self.curPos = 0 >> self.namePattern = "[A-Za-z_][

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, Thomas Ploch wrote: > This is how my regexes look like: > > import re > > class Tags: > def __init__(self, sourceText): > self.source = sourceText > self.curPos = 0 > self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*" > self.tagPattern = re

Re: Question concerning this list [WebCrawler]

2006-12-31 Thread Thomas Ploch
Marc 'BlackJack' Rintsch schrieb: > In <[EMAIL PROTECTED]>, Thomas Ploch > wrote: > >> Alright, my prof said '... to process documents written in structural >> markup languages using regular expressions is a no-no.' (Because of >> nested Elements? Can't remember) So I think he wants us to use rege

Re: Question concerning this list

2006-12-31 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, Thomas Ploch wrote: > Alright, my prof said '... to process documents written in structural > markup languages using regular expressions is a no-no.' (Because of > nested Elements? Can't remember) So I think he wants us to use regexes > to learn them. He is pointing to HTML

Re: Question concerning this list

2006-12-30 Thread Thomas Ploch
Steven D'Aprano wrote: > On Sun, 31 Dec 2006 02:03:34 +0100, Thomas Ploch wrote: > >> Hello fellow pythonists, >> >> I have a question concerning posting code on this list. >> >> I want to post source code of a module, which is a homework for >> university (yes yes, I know, please read on...). >

Re: Question concerning this list

2006-12-30 Thread Steven D'Aprano
On Sun, 31 Dec 2006 02:03:34 +0100, Thomas Ploch wrote: > Hello fellow pythonists, > > I have a question concerning posting code on this list. > > I want to post source code of a module, which is a homework for > university (yes yes, I know, please read on...). So long as you understand your un

Question concerning this list

2006-12-30 Thread Thomas Ploch
Hello fellow pythonists, I have a question concerning posting code on this list. I want to post source code of a module, which is a homework for university (yes yes, I know, please read on...). It is a web crawler (which I will *never* let out into the wide world) which uses regular expressions