On 18/08/2012 19:26, Paul Rubin wrote:
Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)
...
With PEP 393, each Python string will be stored in the most efficient
format possible:
Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.
The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.
--
http://mail.python.org/mailman/listinfo/python-list