[EMAIL PROTECTED] wrote: > 1- Find all html files in the folders (sub-folders ...) > 2- Do some file I/O and feed Sed or Python or what else with the file. > 3- Apply recursively some regular expression on the file to do the > things a want. (delete when it encounters certain tags, certain > attributes) > 4- Write the changed file, and go through all the files like that.
Use the lxml.html.clean module, which is made exactly for that purpose. It's not released yet, but you can use it from the current html branch of lxml. There will soon be an official alpha of the 2.0 series, which will contain lxml.html: http://codespeak.net/svn/lxml/branch/html/ It looks like you're on Ubuntu, so compiling it from sources after an SVN checkout should be as simple as the usual setup.py dance. Please report back to the lxml mailing list if you find any problems or have any further ideas on how to make it even more versatile than it already is. For lxml is general, see: http://codespeak.net/lxml/ Stefan -- http://mail.python.org/mailman/listinfo/python-list