On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth" <[EMAIL PROTECTED]> said: > "Richard Lewis" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hi there, > > > > I'm having a problem with unicode files and ftplib (using Python 2.3.5). > > > > I've got this code: > > > > xml_source = codecs.open("foo.xml", 'w+b', "utf8") > > #xml_source = file("foo.xml", 'w+b') > > > > ftp.retrbinary("RETR foo.xml", xml_source.write) > > #ftp.retrlines("RETR foo.xml", xml_source.write) > > > > It looks like there are at least two problems here. The major one > is that you seem to have a misconception about utf-8 encoding. > Who doesn't? ;-)
> > Whatever program you are using to read it has to then decode > it from utf-8 into unicode. Failure to do this is what is causing > the extra characters on output. > > > Amusingly, this would have worked: > > xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8") > > It is, of course, an expensive way of doing nothing, but > it at least has the virtue of being good documentation. > OK, I've fiddled around a bit more but I still haven't managed to get it to work. I get the fact that its not the FTP operation thats causing the problem so it must be either the xml.minidom.parse() function (and whatever sort of file I give that) or the way that I write my results to output files after I've done my DOM processing. I'll post some more detailed code: def open_file(file_name): ftp = ftplib.FTP(self.host) ftp.login(self.login, self.passwd) content_file = file(file_name, 'w+b') ftp.retrbinary("RETR " + self.path, content_file.write) ftp.quit() content_file.close() ## Case 1: #self.document = parse(file_name) ## Case 2: #self.document = parse(codecs.open(file_name, 'r+b', "utf-8")) # Case 3: content_file = codecs.open(file_name, 'r', "utf-8") self.document = parse(codecs.EncodedFile(content_file, "utf-8", "utf-8")) content_file.close() In Case1 I get the incorrectly encoded characters. In Case 2 I get the exception: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 5208: ordinal not in range(128)" when it calls the xml.minidom.parse() function. In Case 3 I get the exception: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 5208: ordinal not in range(128)" when it calls the xml.minidom.parse() function. The character at position 5208 is an 'a' (assuming Emacs' goto-char function has the same idea about file positions as xml.minidom.parse()?). When I first tried these two new cases it came up with an unencodable character at another position. By replacing the large dash at this position with an ordinary minus sign I stopped it from raising the exception at that point in the file. I checked the character xe6 and (assuming I know what I'm doing) its a small ae ligature. Anyway, later on in the program I create a *very* large unicode string after doing some playing with the DOM tree. I then write this to a file using: html_file = codecs.open(file_name, "w+b", "utf8") html_file.write(very_large_unicode_string) The problem could be here? Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list