Richie Hindle wrote:
> But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
> misspelled tags make it fail:
> >>> from mx.Tidy import tidy
> >>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")

[Various error messages]

> Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked

>>> import libxml2dom
>>> d = libxml2dom.parseString("<html><body><pree>Hello 
>>> world!</pre></body></html>", html=1)
>>> print d.toString()
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:



Reply via email to