Re: Parsing HTML/XML documents

Stefan Behnel Thu, 26 Apr 2007 07:14:10 -0700

[EMAIL PROTECTED] wrote:
> I need to parse real world HTML/XML documents and I found two nice python
> solution: BeautifulSoup and Tidy.


There's also lxml, in case you want a real XML tool.
http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html#parsers


> However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
> Gecko surely handles bad html in a more consistent and error-proof way
> than BS and Tidy.
> 
> I'm interested in using Mozilla DOM from inside a Python script, however
> I'm a bit confused about how can I use pyXPCOM to accomplish this job.

I've never used it, but I doubt Gecko would yield substantially better results
than any of the three above. You're dealing with broken data here, so it just
depends on your input which one of them wins.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML/XML documents

Reply via email to