[EMAIL PROTECTED] wrote: > I work at this company and we are re-building our website: http://caslt.org/. > The new website will be built by an external firm (I could do it > myself, but since I'm just the summer student worker...). Anyways, to > help them, they first asked me to copy all the text from all the pages > of the site (and there is a lot!) to word documents. I found the idea > pretty stupid since style would have to be applied from scratch anyway > since we don't want to get neither old html code behind nor Microsoft > Word BS code. > > I proposed to take each page and making a copy with only the text, and > with class names for the textual elements (h1, h1, p, strong, em ...) > and then define a css file giving them some style. > > Now, we have around 1 600 documents do work on, and I thought I could > challenge myself a bit and automate all the dull work. I thought about > the possibility of parsing all those pages with python, ripping of the > navigations bars and just keeping the text and layout tags, and then > applying class names to specific tags. The program would also have to > remove the table where text is located in. And other difficulty is > that I want to be able to keep tables that are actually used for > tabular data and not positioning. > > So, I'm writing this to have your opinion on what tools I should use > to do this and what technique I should use.
lxml is what you're looking for, especially if you're familiar with XPath. http://codespeak.net/lxml/dev Stefan -- http://mail.python.org/mailman/listinfo/python-list