Kay Schluehr wrote:
On 28 Okt., 15:25, [EMAIL PROTECTED] wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document.  Can / would anyone refer me to a tutorial on how to
do that?  (perhaps from tables).  I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated.  Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A related solution is to use OpenOffice to convert to OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/development/projects/odfpy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to