On Jul 13, 1:57 pm, [EMAIL PROTECTED] wrote: > Hi, > > I'm in the process of refactoring a lot of HTML documents and I'm > using html tidy to do a part of this > work. (clean up, change to xhtml and remove font and center tags) > > Now, Tidy will just do a part of the work I need to > do, I have to remove all the presentational tags and attributes from > the pages (in other words rip off the pages) including the tables that > are used for disposition of content (how to differentiate?). > > I thought about doing that with python (for which I'm in process of > learning), but maybe an other tool (like sed?) would be better suited > for this job. > > I kind of know generally what I need to do: > > 1- Find all html files in the folders (sub-folders ...) > 2- Do some file I/O and feed Sed or Python or what else with the file. > 3- Apply recursively some regular expression on the file to do the > things a want. (delete when it encounters certain tags, certain > attributes) > 4- Write the changed file, and go through all the files like that. > > But I don't know how to do it for real, the syntax and everything. I > also want to pick-up the tool that's the easiest for this job. I heard > about BeautifulSoup and lxml for Python, but I don't know if those > modules would help. > > Now, I know I'm not a the best place to ask if python is the right > choice (anyways even my little finger tells me it is), but if I can do > the same thing more simply with another tool it would be good to know. > > An other argument for the other tools is that I know how to use the > find unix program to find the files and feed them to grep or sed, but > I still don't know what's the syntax with python (fetch files, change > them than write them) and I don't know if I should read the files and > treat them as a whole or just line by line. Of course I could mix > commands with some python, find command to my program's standard > input, and my command's standard output to the original file. But I do > I control STDIN and STDOUT with python? > > Sorry if that's a lot of questions in one, and I will probably get a > lot of RTFM (which I'm doing btw), but I feel I little lost in all > that right now. > > Any help would be really appreciated. > Thanks
You might find a text editor is the way to go.. you can use autoit either through python or by itself to control the text editor you use.. I just downloaded pspad and it looks like it will do that. It may be a pain to script though. http://sourceforge.net/projects/dex-tracker/ -- http://mail.python.org/mailman/listinfo/python-list