Re: HTMLParser fragility

Paul Boddie Thu, 06 Apr 2006 11:20:43 -0700

Richie Hindle wrote:
>
> But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
> misspelled tags make it fail:
>
> >>> from mx.Tidy import tidy
> >>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")


[Various error messages]

> Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:

>>> import libxml2dom
>>> d = libxml2dom.parseString("<html><body><pree>Hello 
>>> world!</pre></body></html>", html=1)
>>> print d.toString()
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

Reply via email to