On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
This still seems odd to me. I would have thought that the unicode function would return a properly encoded byte stream that could then simply be written to disk. Instead it seems like you have to re-encode the byte stream to some kind of escaped Ascii before it can be written back out.
Here's what's really going on. Unicode strings within Python have to be indexable. So the internal representation of Unicode has (usually) two bytes for each character, so they work like arrays. UTF-8 is a stream format for Unicode. It's slightly compressed; each character occupies 1 to 4 bytes, and the base ASCII characters (0..127 only, not 128..255) occupy one byte each. The format is described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or stream has to be parsed from the beginning to keep track of where each Unicode character begins. So it's not a suitable format for data being actively worked on in memory; it can't be easily indexed. That's why it's necessary to convert to UTF-8 before writing to a file or socket. John Nagle -- http://mail.python.org/mailman/listinfo/python-list