Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-21 Thread Stefan Behnel
Gabriel Genellina wrote: > On 21 ago, 18:36, [EMAIL PROTECTED] (John J. Lee) wrote: >> Gabriel Genellina <[EMAIL PROTECTED]> writes: >> >> [...]> Don't even try to understand it - it's a mess. Use the HTMLParser >>> module instead. >> [...] >> >> Module sgmllib (and therefore module htmllib also) i

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-21 Thread Gabriel Genellina
On 21 ago, 18:36, [EMAIL PROTECTED] (John J. Lee) wrote: > Gabriel Genellina <[EMAIL PROTECTED]> writes: > > [...]> Don't even try to understand it - it's a mess. Use the HTMLParser > > module instead. > > [...] > > Module sgmllib (and therefore module htmllib also) is more tolerant of > bad HTML t

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-21 Thread John J. Lee
Gabriel Genellina <[EMAIL PROTECTED]> writes: [...] > Don't even try to understand it - it's a mess. Use the HTMLParser > module instead. [...] Module sgmllib (and therefore module htmllib also) is more tolerant of bad HTML than module HTMLParser. John -- http://mail.python.org/mailman/listinfo

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-20 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: > I personally think the application itself "feels" more complicated > than it needs to be but its possible that is just my inexperience. I'm > going to do some reading about the HTMLParser module. I'm sure I > could make this spider a bit more functional in the process.

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-20 Thread [EMAIL PROTECTED]
Those responses were both very helpful. John's additional type checking is straight forward and easy to implement. I will also rewrite the application a second time using the class Gabriel offered. Both of these suggestions will help gain some insight into how Python works. "Don't even try to

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-20 Thread Gabriel Genellina
On 20 ago, 15:44, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > -- > f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO())) > parser = htmllib.HTMLParser(f) > parser.feed(html) > parser.close() > return parser.anchor

Re: urllib2.urlopen(url) pulling something other than HTML

2007-08-20 Thread John J. Lee
"[EMAIL PROTECTED]" <[EMAIL PROTECTED]> writes: [...] > -- > f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO())) > parser = htmllib.HTMLParser(f) > parser.feed(html) > parser.close() > return parser.anchorlist > -

urllib2.urlopen(url) pulling something other than HTML

2007-08-20 Thread [EMAIL PROTECTED]
I am reading "Python for Dummies" and found the following example of a web crawler that I thought was interesting. The first time I keyed the program and executed it I didn't understand it well enough to debug it so I just skipped it. A few days later I realized that it failed after a few second