On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote: > On 08/17/2012 08:21 PM, Ian Kelly wrote: >> On Aug 17, 2012 2:58 PM, "Dave Angel" <d...@davea.name> wrote: >>> The internal coding described in PEP 393 has nothing to do with >>> latin-1 encoding. >> It certainly does. PEP 393 provides for Unicode strings to be >> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is >> smallest and sufficient to contain the data.
Unicode strings are not represented as Latin-1 internally. Latin-1 is a byte encoding, not a unicode internal format. Perhaps you mean to say that they are represented as a single byte format? >> I understand the complaint >> to be that while the change is great for strings that happen to fit in >> Latin-1, it is less efficient than previous versions for strings that >> do not. > > That's not the way I interpreted the PEP 393. It takes a pure unicode > string, finds the largest code point in that string, and chooses 1, 2 or > 4 bytes for every character, based on how many bits it'd take for that > largest code point. That's how I interpret it too. > Further i read it to mean that only 00 bytes would > be dropped in the process, no other bytes would be changed. Just to clarify, you aren't talking about the \0 character, but only to extraneous "padding" 00 bytes. > I also figure this is going to be more space efficient than Python 3.2 > for any string which had a max code point of 65535 or less (in Windows), > or 4billion or less (in real systems). So unless French has code points > over 64k, I can't figure that anything is lost. I think that on narrow builds, it won't make terribly much difference. The big savings are for wide builds. -- Steven -- http://mail.python.org/mailman/listinfo/python-list