On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote: > This still seems odd to me. I would have thought that the unicode > function would return a properly encoded byte stream that could then > simply be written to disk. Instead it seems like you have to re-encode > the byte stream to some kind of escaped Ascii before it can be written > back out.
I'm afraid that's not even wrong. The unicode function returns a unicode string object, not a byte-stream, just as the list function returns a sequence of objects, not a byte-stream. Perhaps this will help: http://www.joelonsoftware.com/articles/Unicode.html Summary: ASCII is not a synonym for bytes, no matter what some English-speakers think. ASCII is an encoding from bytes like \x41 to characters like "A". Unicode strings are a sequence of code points. A code point is a number, implemented in some complex fashion that you don't need to care about. Each code point maps conceptually to a letter; for example, the English letter A is represented by the code point U+0041 and the Arabic letter Ain is represented by the code point U+0639. You shouldn't make any assumptions about the size of each code-point, or how they are put together. You shouldn't expect to write code points to a disk and have the result make sense, any more than you could expect to write a sequence of tuples or sets or dicts to disk in any sensible fashion. You have to serialise it to bytes first, and that's what the encode method does. Decode does the opposite, taking bytes and creating unicode strings from them. For historical reasons -- backwards compatibility with files already created, back in the Bad Old Days before unicode -- there are a whole slew of different encodings available. There is no 1:1 mapping between bytes and strings. If all you have are the bytes, there is literally no way of knowing what string they represent (although sometimes you can guess). You need to know what the encoding used was, or take a guess, or make repeated decodings until something doesn't fail and hope that's the right one. As a general rule, Python will try encoding/decoding using the ASCII encoding unless you tell it differently. Any time you are writing to disk, you need to serialise the objects, regardless of whether they are floats, or dicts, or unicode strings. -- Steven -- http://mail.python.org/mailman/listinfo/python-list