.doc to html and pdf conversion with python

2006-10-14 Thread Alexander Klingenstein
I need to take a bunch of .doc files (word 2000) which have a little text 
including some tables/layout and mostly pictures and comvert them to a pdf and 
extract the text and images separately too. If I have a pdf, I can do create 
the html with pdftohtml called from python with popen. However I need an 
automated way to converst the .doc to PDF first.

Is there a way to do what I want either with a python lib, 3rd party app, or 
maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller?
I already tried wvware from gwnuwin32, however it has problems with big image 
files embedded in .doc file(looks like a mmap error).

Alex

__
XXL-Speicher, PC-Virenschutz, Spartarife & mehr: Nur im WEB.DE Club!
Jetzt gratis testen! http://freemail.web.de/home/landingpad/?mc=021130

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: .doc to html and pdf conversion with python

2006-10-14 Thread Alexander Klingenstein
> Is there some reason you really want to convert to PDF first? You can
> get much better HTML right from the Word doc. You'll lose a lot of info
> going from PDF to HTML.

Right now, two reasons: Printing to PDF allows me to create the PDF "for
the web" which means it has a much smaller filesize needed for downloading.
Mainly accomplished with automatic image resizing.

2nd, htmltidy simply doesn't work for the files I get from Word HTML. I need it
to work automagically without intervention and usable by not really PC literate
people:

D:\tmp\pdftohtml>tidy auberer.htm -o aub2.htm
line 1 column 1 - Warning: missing  declaration
line 172 column 70 - Error:  is not recognized!
line 172 column 70 - Warning: discarding unexpected 
line 172 column 75 - Warning: discarding unexpected 
line 176 column 47 - Error:  is not recognized!
line 176 column 47 - Warning: discarding unexpected 
line 176 column 52 - Warning: discarding unexpected 
line 179 column 77 - Error:  is not recognized!
line 179 column 77 - Warning: discarding unexpected 
line 179 column 82 - Warning: discarding unexpected 
line 182 column 55 - Error:  is not recognized!
line 182 column 55 - Warning: discarding unexpected 
line 182 column 60 - Warning: discarding unexpected 
line 185 column 57 - Error:  is not recognized!
line 185 column 57 - Warning: discarding unexpected 
line 185 column 62 - Warning: discarding unexpected 
line 188 column 55 - Error:  is not recognized!
94 warnings, 38 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

So far, pdftohtml has worked flawlessly and created much saner HTML output
out of the box than Word 2000
_
Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
http://smartsurfer.web.de/?mc=100071&distributionid=0066

-- 
http://mail.python.org/mailman/listinfo/python-list