Re: Ascii to Unicode.

John Nagle Thu, 29 Jul 2010 12:28:01 -0700

On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:

This still seems odd to me.  I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.


   Here's what's really going on.

   Unicode strings within Python have to be indexable.  So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

   UTF-8 is a stream format for Unicode.  It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each.  The format is
described in "http://en.wikipedia.org/wiki/UTF-8";.  A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins.  So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.

   That's why it's necessary to convert to UTF-8 before writing
to a file or socket.

                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Re: Ascii to Unicode.

Reply via email to