random...@fastmail.us wrote: > My point is there are very few > problems to which "count of Unicode code points" is the only right > answer - that UTF-32 is good enough for but that are meaningfully > impacted by a naive usage of UTF-16, to the point where UTF-16 is > something you have to be "safe" from.
I'm not sure why you care about the "count of Unicode code points", although that *is* a problem. Not for end-user reasons like "how long is my password?", but because it makes your job as a programmer harder. [steve@ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))" 4 [steve@ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))" 3 It's hard to reason about your code when something as fundamental as the length of a string is implementation-dependent. (By the way, the right answer should be 3, not 4.) But an even more important problem is that broken-UTF-16 lets you create invalid, impossible Unicode strings *by accident*. Naturally you can create broken Unicode if you assemble strings of surrogates yourself, but broken-UTF-16 means it can happen from otherwise innocuous operations like reversing a string: py> s = u'\U00004444:\U00014445' # Python 2.7 narrow build py> s[::-1] u'\udc45\ud811:\u4444' It's hard for me to demonstrate that the reversed string is broken because the shell I am using does an amazingly good job of handling broken Unicode. Even if I print it, the shell just prints missing-character glyphs instead of crashing (fortunately for me!). But the first two code points are in illegal order: \udc45 is a high surrogate, and must follow a low surrogate; \ud811 is a low surrogate, and must precede a high surrogate; I'm not convinced you should be allowed to create Unicode strings containing mismatched surrogates like this deliberately, but you certainly shouldn't be able to do so by accident. -- Steven -- https://mail.python.org/mailman/listinfo/python-list