Alexander Klingenstein wrote: > I need to take a bunch of .doc files (word 2000) which have a little text > including some tables/layout and mostly pictures and comvert them to a pdf > and extract the text and images > separately too. If I have a pdf, I can do > create the html with pdftohtml called from python with > popen. However I > need an automated way to converst the .doc to PDF first.
Is there some reason you really want to convert to PDF first? You can get much better HTML right from the Word doc. You'll lose a lot of info going from PDF to HTML. Something like this can open doc in Word, save as HTML, then close doc. import os, win32com.client wdApp = win32com.client.Dispatch("Word.Application") wdApp.Visible = 1 def SaveDocAsHTML(docPath, htmlPath): doc = wdApp.Documents.Open(docPath) # See mk:@MSITStore:C:\Program%20Files\Microsoft%20Office\OFFICE11\1033\VBAWD10.CHM::/html/womthSaveAs1.htm # in Word VBA help doc for more info. # Saves all text and formatting with HTML tags so that the resulting document can be viewed in a Web browser. doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML) # Saves text with HTML tags with minimal cascading style sheet formatting. The resulting document can be viewed in a Web browser. #doc.SaveAs(htmlPath, win32com.client.constants.wdFormatFilteredHTML) doc.Close() And if you aren't satisfied with the ugly HTML you're likely to get, you can try running µTidylib (http://utidylib.berlios.de/) on the output after this step also. Thank you, Paul -- http://mail.python.org/mailman/listinfo/python-list