Re: HTMLParser fragility

Walter Dörwald Thu, 06 Apr 2006 06:50:53 -0700

Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this 
>> is, there's a lot of malformed HTML out there. Real browsers have to be 
>> written to cope gracefully with this, but HTMLParser does not. 
> 
> There are two solutions to this:
> 
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
> 
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/


You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
   Walter Dörwald

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

Reply via email to