Re: urllib2.urlopen(url) pulling something other than HTML

Stefan Behnel Tue, 21 Aug 2007 23:51:07 -0700

Gabriel Genellina wrote:
> On 21 ago, 18:36, [EMAIL PROTECTED] (John J. Lee) wrote:
>> Gabriel Genellina <[EMAIL PROTECTED]> writes:
>>
>> [...]> Don't even try to understand it - it's a mess. Use the HTMLParser
>>> module instead.
>> [...]
>>
>> Module sgmllib (and therefore module htmllib also) is more tolerant of
>> bad HTML than module HTMLParser.
> 
> I had the impression it was the opposite; anyway, neither of them can
> handle really bad html.
> I just don't *like* htmllib.HTMLParser - but that's only a matter of
> taste.


lxml.html handles bad HTML and it's a powerful tool that is very easy to use.
And if one day you have to deal with really, *really* broken tag soup, it also
comes with BeautifulSoup parser integration.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: urllib2.urlopen(url) pulling something other than HTML

Reply via email to