That KB document was really helpful, but the problem still isn't solved. What's wierd now is that the unicode characters like become รจ in some odd conversion. However, I noticed when I try to open the word documents after I run the first for statement that Word gives me a window that says File Conversion and asks me how i want to encode it. None of the unicode options retain the characters. Then I looked some more and found it has a central european option both ISO and Windows which works perfectly since the documents I am looking at are in Czech. Then I try to save the document in word and it says if I try to save it as a text file I will lose the formating! So I guess I'm back at the start.
Judging from some internet searches, I'm not the only one having this problem. For some reason Word can only save as .doc even though .txt can support the utf8 format with all these characters. Any ideas? On Oct 22, 5:39 am, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Sun, 21 Oct 2007 15:32:57 -0300, <[EMAIL PROTECTED]> escribi?: > > > However, I still cannot read the unicode from the Word file. If take > > out the first for-statement, I get a bunch of garbled text, which > > isn't helpful. I would save them all manually, but I want to figure > > out how to do it in Python, since I'm just beginning. > > > My intuition says the problem is with > > > FileFormat=win32com.client.constants.wdFormatText > > > because it converts fine to a text file, just not a utf-8 text file. > > How can I modify this or is there another way to code this type of > > file conversion from *.doc to *.txt with unicode characters? > > Ah! I thought you were getting the right file format. > I can't test it now, but this KB > documenthttp://support.microsoft.com/kb/209186/en-us > suggests you should use wdFormatUnicodeText when saving the document. > What the MS docs call "unicode" when dealing with files, is in general > utf16. > In this case, if you want to convert to utf8, the sequence would be: > > f = open(original_filename, "rb") > udata = f.read().decode("utf16") > f.close() > f = open(new_filename, "wb") > f.write(udata.encode("utf8")) > f.close() > > -- > Gabriel Genellina
-- http://mail.python.org/mailman/listinfo/python-list