Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > Marko Rauhamaa wrote: >> '\udd00' is a valid str object: > > Is it though? Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in > the first place. That's not a rhetorical question. It's a genuine > question.
The problem is that no matter how you shuffle surrogates, encoding schemes, coding points and the like, a wrinkle always remains. I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But that's where the buck stops; traditional arithmetic functions are closed under ℂ. Unicode apparently hasn't found a similar closure. That's why I think that while UTF-8 is a fabulous way to bring Unicode to Linux, Linux should have taken the tack that Unicode is always an application-level interpretation with few operating system tie-ins. Unfortunately, the GNU world is busy trying to build a Unicode frosting everywhere. The illusion can never be complete but is convincing enough for application developers to forget to handle corner cases. To answer your question, I think every code point from 0 to 1114111 should be treated as valid and analogous. Thus Python is correct here: >>> len('\udd00') 1 >>> len('\ufeff') 1 The alternatives are far too messy to consider. Marko -- https://mail.python.org/mailman/listinfo/python-list