"Richard Lewis" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth" > <[EMAIL PROTECTED]> said: >> "Richard Lewis" <[EMAIL PROTECTED]> wrote in message >> news:[EMAIL PROTECTED] >> > Hi there, >> > >> > I'm having a problem with unicode files and ftplib (using Python >> > 2.3.5). >> > >> > I've got this code: >> > >> > xml_source = codecs.open("foo.xml", 'w+b', "utf8") >> > #xml_source = file("foo.xml", 'w+b') >> > >> > ftp.retrbinary("RETR foo.xml", xml_source.write) >> > #ftp.retrlines("RETR foo.xml", xml_source.write) >> > >> >> It looks like there are at least two problems here. The major one >> is that you seem to have a misconception about utf-8 encoding. >> > Who doesn't? ;-)
Lots of people. It's not difficult to understand, it just takes a bit of attention to the messy details. The basic concept is that Unicode is _always_ processed using a unicode string _in the program_. On disk or across the internet, it's _always_ stored in an encoded form, frequently but not always utf-8. A regular string _never_ stores raw unicode; it's always some encoding. When you read text data from the internet, it's _always_ in some encoding. If that encoding is one of the utf- encodings, it needs to be converted to unicode to be processed, but it does not need to be changed at all to write it to disk. >> Whatever program you are using to read it has to then decode >> it from utf-8 into unicode. Failure to do this is what is causing >> the extra characters on output. >> > >> >> Amusingly, this would have worked: >> >> xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8") >> >> It is, of course, an expensive way of doing nothing, but >> it at least has the virtue of being good documentation. >> > OK, I've fiddled around a bit more but I still haven't managed to get it > to work. I get the fact that its not the FTP operation thats causing the > problem so it must be either the xml.minidom.parse() function (and > whatever sort of file I give that) or the way that I write my results to > output files after I've done my DOM processing. I'll post some more > detailed code: Please post _all_ of the relevant code. It wastes people's time when you post incomplete examples. The critical issue is frequently in the part that you didn't post. > > def open_file(file_name): > ftp = ftplib.FTP(self.host) > ftp.login(self.login, self.passwd) > > content_file = file(file_name, 'w+b') > ftp.retrbinary("RETR " + self.path, content_file.write) > ftp.quit() > content_file.close() > > ## Case 1: > #self.document = parse(file_name) > > ## Case 2: > #self.document = parse(codecs.open(file_name, 'r+b', "utf-8")) > > # Case 3: > content_file = codecs.open(file_name, 'r', "utf-8") > self.document = parse(codecs.EncodedFile(content_file, "utf-8", > "utf-8")) > content_file.close() > > In Case1 I get the incorrectly encoded characters. > > In Case 2 I get the exception: > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in > position 5208: ordinal not in range(128)" > when it calls the xml.minidom.parse() function. > > In Case 3 I get the exception: > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in > position 5208: ordinal not in range(128)" > when it calls the xml.minidom.parse() function. That's exactly what you should expect. In the first case, the file on disk is encoded as utf-8, and this is aparently what mini-dom is expecting. The documentation shows a simple read, it does not show any kind of encoding or decoding. > Anyway, later on in the program I create a *very* large unicode string > after doing some playing with the DOM tree. I then write this to a file > using: > html_file = codecs.open(file_name, "w+b", "utf8") > html_file.write(very_large_unicode_string) > > The problem could be here? That should work. The problem, as I said in the first post, is that whatever program you are using to render the file to screen or print is _not_ treating the file as utf-8 encoded. It either needs to be told that the file is in utf-8 encoding, or you need to get a better rendering program. Many renderers, including most renderers inside of programming tools like file inspectors and debuggers, assume that the encoding is latin-1 or windows-1252. This will throw up funny characters if you try to read a utf-8 (or any multi-byte encoded) file using them. One trick that sometimes works is to insure that the first character is the BOM (byte order mark, or unicode signature). Properly written Windows programs will use this as an encoding signature. Unixoid programs frequently won't, but that's arguably a violation of the Unicode standard. This is a single unicode character which is three characters in utf-8 encoding. John Roth > > Cheers, > Richard -- http://mail.python.org/mailman/listinfo/python-list