On Mon, 29 Mar 2010 10:12:09 +0200, Stéphane Klein wrote: > Hi, > > I work on HTML cleaner. > > I export OpenOffice.org documents to HTML. Next, I would like clean this > HTML export files : > > * remove comment > * remove style > * remove dispensable tag > * ... > > some difficulty : > > * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p> > * convert <h1><span><font>my title</font></span></h1> => <h1>my > title</h1> > > to do this process, I use lxml and pyquery. > > Question : > > * are there some xml helper tools in Python to do this process ? I've > looked for in pypi, I found nothing about it > > If you confirm than this tools don't exists, I'll maybe publish a helper > package to do this "clean" processing. > > Thanks for your help, > Stephane
Take a look at htmllib and HTMLParser (two different modules) in the Python built-in library. In Python 3.x there is one called html.parser You can use this to parse out specific tags from HTML documents. If you want something more advanced, consider using XML. -- Harishankar (http://harishankar.org http://literaryforums.org) -- http://mail.python.org/mailman/listinfo/python-list