Hi, I'm in the process of refactoring a lot of HTML documents and I'm using html tidy to do a part of this work. (clean up, change to xhtml and remove font and center tags)
Now, Tidy will just do a part of the work I need to do, I have to remove all the presentational tags and attributes from the pages (in other words rip off the pages) including the tables that are used for disposition of content (how to differentiate?). I thought about doing that with python (for which I'm in process of learning), but maybe an other tool (like sed?) would be better suited for this job. I kind of know generally what I need to do: 1- Find all html files in the folders (sub-folders ...) 2- Do some file I/O and feed Sed or Python or what else with the file. 3- Apply recursively some regular expression on the file to do the things a want. (delete when it encounters certain tags, certain attributes) 4- Write the changed file, and go through all the files like that. But I don't know how to do it for real, the syntax and everything. I also want to pick-up the tool that's the easiest for this job. I heard about BeautifulSoup and lxml for Python, but I don't know if those modules would help. Now, I know I'm not a the best place to ask if python is the right choice (anyways even my little finger tells me it is), but if I can do the same thing more simply with another tool it would be good to know. An other argument for the other tools is that I know how to use the find unix program to find the files and feed them to grep or sed, but I still don't know what's the syntax with python (fetch files, change them than write them) and I don't know if I should read the files and treat them as a whole or just line by line. Of course I could mix commands with some python, find command to my program's standard input, and my command's standard output to the original file. But I do I control STDIN and STDOUT with python? Sorry if that's a lot of questions in one, and I will probably get a lot of RTFM (which I'm doing btw), but I feel I little lost in all that right now. Any help would be really appreciated. Thanks -- http://mail.python.org/mailman/listinfo/python-list