On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka <storch...@gmail.com> wrote: > 24.07.13 21:15, Chris Angelico написав(ла): > >> To my mind, exposing UTF-16 >> surrogates to the application is a bug to be fixed, not a feature to >> be maintained. > > > Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates > area) to represent undecodable bytes with surrogateescape error handler.
That's a deliberate and conscious use of the codepoints; that's not what I'm talking about here. Suppose you read a UTF-8 stream of bytes from a file, and decode them into your language's standard string type. At this point, you should be working with a string of Unicode codepoints: "\22\341\210\264\360\222\215\205" --> "\x12\u1234\U00012345" The incoming byte stream has a length of 8, the resulting character stream has a length of 3. Now, if the language wants to use UTF-16 internally, it's free to do so: 0012 1234 d808 df45 When I referred to exposing surrogates to the application, this is what I'm talking about. If decoding the above byte stream results in a length 4 string where the last two are \xd808 and \xdf45, then it's exposing them. If it's a length 3 string where the last is \U00012345, then it's hiding them. To be honest, I don't imagine I'll ever see a language that stores strings in UTF-16 and then exposes them to the application as UTF-32; there's very little point. But such *is* possible, and if it's working closely with libraries that demand UTF-16, it might well make sense to do things that way. ChrisA -- http://mail.python.org/mailman/listinfo/python-list