On Thu, Jul 25, 2013 at 8:09 AM, Terry Reedy <tjre...@udel.edu> wrote: > On 7/24/2013 2:15 PM, Chris Angelico wrote: >> To my mind, exposing UTF-16 surrogates to the application is a bug >> to be fixed, not a feature to be maintained. > > It is definitely not a feature, but a proper UTF-16 implementation would not > expose them except to codecs, just as with the PEP 393 implementation. (In > both cases, I am excluding the sys size function as 'exposing to the > application'.) > >> But since we can get the best of both worlds with only >> a small amount of overhead, I really don't see why anyone should be >> objecting. > > I presume you are referring to the PEP 393 1-2-4 byte implementation. Given > how well it has been optimized, I think it was the right choice for Python. > But a language that now uses USC2 or defective UTF-16 on all platforms might > find the auxiliary array an easier fix. >
I'm referring here to objections like jmf's, and also to threads like this: http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html According to the ECMAScript people, UTF-16 and exposing surrogates to the application is a critical feature to be maintained. I disagree. But it's not my language, so I'm stuck with it. (I ended up writing a little wrapper function in C that detects unpaired surrogates, but that still doesn't deal with the possibility that character indexing can create a new character that was never there to start with.) ChrisA -- http://mail.python.org/mailman/listinfo/python-list