Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student worker...). Anyways, to help them, they first asked me to copy all the text from all the pages of the site (and there is a lot!) to word documents. I found the idea pretty stupid since style would have to be applied from scratch anyway since we don't want to get neither old html code behind nor Microsoft Word BS code.
I proposed to take each page and making a copy with only the text, and with class names for the textual elements (h1, h1, p, strong, em ...) and then define a css file giving them some style. Now, we have around 1 600 documents do work on, and I thought I could challenge myself a bit and automate all the dull work. I thought about the possibility of parsing all those pages with python, ripping of the navigations bars and just keeping the text and layout tags, and then applying class names to specific tags. The program would also have to remove the table where text is located in. And other difficulty is that I want to be able to keep tables that are actually used for tabular data and not positioning. So, I'm writing this to have your opinion on what tools I should use to do this and what technique I should use. -- http://mail.python.org/mailman/listinfo/python-list