On Jul 13, 7:07 pm, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > On Jul 13, 1:57 pm, [EMAIL PROTECTED] wrote: > > > > > > > Hi, > > > I'm in the process of refactoring a lot of HTML documents and I'm > > using html tidy to do a part of this > > work. (clean up, change to xhtml and remove font and center tags) > > > Now, Tidy will just do a part of the work I need to > > do, I have to remove all the presentational tags and attributes from > > the pages (in other words rip off the pages) including the tables that > > are used for disposition of content (how to differentiate?). > > > I thought about doing that with python (for which I'm in process of > > learning), but maybe an other tool (like sed?) would be better suited > > for this job. > > > I kind of know generally what I need to do: > > > 1- Find all html files in the folders (sub-folders ...) > > 2- Do some file I/O and feed Sed or Python or what else with the file. > > 3- Apply recursively some regular expression on the file to do the > > things a want. (delete when it encounters certain tags, certain > > attributes) > > 4- Write the changed file, and go through all the files like that. > > > But I don't know how to do it for real, the syntax and everything. I > > also want to pick-up the tool that's the easiest for this job. I heard > > about BeautifulSoup and lxml for Python, but I don't know if those > > modules would help. > > > Now, I know I'm not a the best place to ask if python is the right > > choice (anyways even my little finger tells me it is), but if I can do > > the same thing more simply with another tool it would be good to know. > > > An other argument for the other tools is that I know how to use the > > find unix program to find the files and feed them to grep or sed, but > > I still don't know what's the syntax with python (fetch files, change > > them than write them) and I don't know if I should read the files and > > treat them as a whole or just line by line. Of course I could mix > > commands with some python, find command to my program's standard > > input, and my command's standard output to the original file. But I do > > I control STDIN and STDOUT with python? > > > Sorry if that's a lot of questions in one, and I will probably get a > > lot of RTFM (which I'm doing btw), but I feel I little lost in all > > that right now. > > > Any help would be really appreciated. > > Thanks > > You might find a text editor is the way to go.. you can use autoit > either through python or by itself to control the text editor you > use.. I just downloaded pspad and it looks like it will do that. It > may be a pain to script though. > > http://sourceforge.net/projects/dex-tracker/- Hide quoted text - > > - Show quoted text -
let me add to that it may be a pain to script with autoit and I am not doing more of an example because it won't insert a textfile at a location like mdipad will. -- http://mail.python.org/mailman/listinfo/python-list