(pardon the resend, but I accidentally omitted a couple of words) On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: > Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit : >> <SNIP> >> >> >> No, it uses Unicode, and as an optimization, attempts to store the >> codepoints in less than four bytes for most strings. The fact that a >> one-byte storage format happens to look like latin-1 is rather >> coincidental. >> > And this this is the common basic mistake. You do not push your > argumentation far enough. A character may "fall" accidentally in a latin-1. > The problem lies in these european characters, which can not fall in this > coding. This *is* the cause of the negative side effects. > If you are using a correct coding scheme, like cp1252, mac-roman or > iso-8859-15, you will never see such a negative side effect. > Again, the problem is not the result, the encoded character. The critical > part is the character which may cause this side effect. > You should think "character set" and not encoded "code point", considering > this kind of expression has a sense in 8-bits coding scheme. > > jmf
But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would work, would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. The 8 bit form is used if all the code points are less than 256. That is a simple description, and simple code. As several people have said, the fact that this byte matches on of the DECODED forms is coincidence. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list