.doc to html and pdf conversion with python
I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However I need an automated way to converst the .doc to PDF first. Is there a way to do what I want either with a python lib, 3rd party app, or maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller? I already tried wvware from gwnuwin32, however it has problems with big image files embedded in .doc file(looks like a mmap error). Alex __ XXL-Speicher, PC-Virenschutz, Spartarife & mehr: Nur im WEB.DE Club! Jetzt gratis testen! http://freemail.web.de/home/landingpad/?mc=021130 -- http://mail.python.org/mailman/listinfo/python-list
Re: .doc to html and pdf conversion with python
> Is there some reason you really want to convert to PDF first? You can > get much better HTML right from the Word doc. You'll lose a lot of info > going from PDF to HTML. Right now, two reasons: Printing to PDF allows me to create the PDF "for the web" which means it has a much smaller filesize needed for downloading. Mainly accomplished with automatic image resizing. 2nd, htmltidy simply doesn't work for the files I get from Word HTML. I need it to work automagically without intervention and usable by not really PC literate people: D:\tmp\pdftohtml>tidy auberer.htm -o aub2.htm line 1 column 1 - Warning: missing declaration line 172 column 70 - Error: is not recognized! line 172 column 70 - Warning: discarding unexpected line 172 column 75 - Warning: discarding unexpected line 176 column 47 - Error: is not recognized! line 176 column 47 - Warning: discarding unexpected line 176 column 52 - Warning: discarding unexpected line 179 column 77 - Error: is not recognized! line 179 column 77 - Warning: discarding unexpected line 179 column 82 - Warning: discarding unexpected line 182 column 55 - Error: is not recognized! line 182 column 55 - Warning: discarding unexpected line 182 column 60 - Warning: discarding unexpected line 185 column 57 - Error: is not recognized! line 185 column 57 - Warning: discarding unexpected line 185 column 62 - Warning: discarding unexpected line 188 column 55 - Error: is not recognized! 94 warnings, 38 errors were found! Not all warnings/errors were shown. This document has errors that must be fixed before using HTML Tidy to generate a tidied up version. So far, pdftohtml has worked flawlessly and created much saner HTML output out of the box than Word 2000 _ Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! http://smartsurfer.web.de/?mc=100071&distributionid=0066 -- http://mail.python.org/mailman/listinfo/python-list