Richie Hindle wrote: > > But Tidy fails on huge numbers of real-world HTML pages. Simple things like > misspelled tags make it fail: > > >>> from mx.Tidy import tidy > >>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")
[Various error messages] > Is there a Python HTML tidier which will do as good a job as a browser? As pointed out elsewhere, libxml2 will attempt to parse HTML if asked to: >>> import libxml2dom >>> d = libxml2dom.parseString("<html><body><pree>Hello >>> world!</pre></body></html>", html=1) >>> print d.toString() <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><pree>Hello world!</pree></body></html> See how it fixes up the mismatching tags. The libxml2dom package is available in the usual place: http://www.python.org/pypi/libxml2dom Paul -- http://mail.python.org/mailman/listinfo/python-list