Joe Goldthwaite wrote: > import unicodedata > > input = file('ascii.csv', 'rb') > output = file('unicode.csv','wb') > > for line in input.xreadlines(): > unicodestring = unicode(line, 'latin1') > output.write(unicodestring.encode('utf-8')) # This second encode > is what I was missing.
Actually, I see two problems here: 1. "ascii.csv" is not an ASCII file but a Latin-1 encoded file, so there starts the first confusion. 2. "unicode.csv" is not a "Unicode" file, because Unicode is not a file format. Rather, it is a UTF-8 encoded file, which is one encoding of Unicode. This is the second confusion. > A number of you pointed out what I was doing wrong but I couldn't > understand it until I realized that the write operation didn't work until > it was using a properly encoded Unicode string. The write function wants bytes! Encoding a string in your favourite encoding yields bytes. > This still seems odd to me. I would have thought that the unicode > function would return a properly encoded byte stream that could then > simply be written to disk. No, unicode() takes a byte stream and decodes it according to the given encoding. You then get an internal representation of the string, a unicode object. This representation typically resembles UCS2 or UCS4, which are more suitable for internal manipulation than UTF-8. This object is a string btw, so typical stuff like concatenation etc are supported. However, the internal representation is a sequence of Unicode codepoints but not a guaranteed sequence of bytes which is what you want in a file. > Instead it seems like you have to re-encode the byte stream to some > kind of escaped Ascii before it can be written back out. As mentioned above, you have a string. For writing, that string needs to be transformed to bytes again. Note: You can also configure a file to read one encoding or write another. You then get unicode objects from the input which you can feed to the output. The important difference is that you only specify the encoding in one place and it will probably even be more performant. I'd have to search to find you the according library calls though, but starting point is http://docs.python.org. Good luck! Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 -- http://mail.python.org/mailman/listinfo/python-list