Stéphane Klein, 29.03.2010 10:12:
I work on HTML cleaner.
I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :
* remove comment
* remove style
* remove dispensable tag
* ...
some difficulty :
* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>
to do this process, I use lxml and pyquery.
lxml.html has tools for that in the 'clean' module. Just specify the list
of tags that you want to discard.
* are there some xml helper tools in Python to do this process ? I've
looked for in pypi, I found nothing about it
The HTML tools in the standard library are close to non-existant. You can
achieve some things with the builtin tools, but if they fail for a
particular input document, there's little you can do.
If you confirm than this tools don't exists, I'll maybe publish a helper
package to do this "clean" processing.
Take a look at lxml.html.clean first.
Stefan
--
http://mail.python.org/mailman/listinfo/python-list