[EMAIL PROTECTED] wrote: > I'm trying to extract some data from an XHTML Transitional web page. > > What is best way to do this?
An XML parser should be sufficient. However... > xml.dom.minidom.parseString("text of web page") gives errors about it > not being well formed XML. > > Do I just need to add something like <?xml ...?> or what? If the page isn't well-formed then it isn't proper XHTML since the XHTML specification [1] says... 4.1. Documents must be well-formed Yes, it's a heading, albeit in an "informative" section describing how XHTML differs from HTML 4. See "3.2. User Agent Conformance" for a "normative" mention of well-formedness. You could try libxml2dom (or other libxml2-based solutions) for some fairly effective HTML parsing: libxml2dom.parseString("text of document here", html=1) See http://www.python.org/pypi/libxml2dom for more details. Paul [1] http://www.w3.org/TR/xhtml1/ -- http://mail.python.org/mailman/listinfo/python-list