On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
> Using the surrogate mechanism, UTF-16 can support all 1,114,112 > potential Unicode characters. > > But Unicode doesn't contain 1,114,112 characters—the surrogates are > excluded from Unicode, and definitely cannot be encoded using > UTF-anything. Surrogates are most certainly part of the Unicode standard, and they are necessary in UTF-16. (You cannot represent astral characters without them!) So in a UTF-16 stream, a *pair* of surrogates is nothing unusual. They just represent a SMP code point. However, *single* surrogates are an error. For example, we see this FAQ: Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. http://www.unicode.org/faq/utf_bom.html#utf32-7 But nobody says that programming languages must deal with only conformant converters and valid Unicode sequences. An unfortunate fact of life that even if you don't generate them yourself, somebody else will so you need to be able to deal with them. [...] > We still don't know if the final result will be UCS-4 everywhere (with > all 2**32 code points allowed?!) or UTF-8 everywhere. Unicode does not have 2**32 code points. It is guaranteed to never exceed the 2**21 code points already allocated. (Many of those are still unused.) As far as I am concerned, the future is clear: UTF-8 for transmission and storage formats, where fast random access is not necessary; UTF-32 for in-memory formats, where O(1) random access is advantagous. Possibly with certain in-memory optimizations to save space, where such can be done transparently. In the future, we will no more balk at using four whole bytes for a code point than we now balk at using eight bytes for floating point numbers. The mathematical advantages of float Doubles are just overwhelming, and the only reason for using fewer than 64 bits is if you care more about getting a fast answer than an accurate answer. (I'm reminded of one of my wife's former roadies, back in the 70s, crossing the US desert in a van. On being told that he was heading in the wrong direction for their next gig, he replied "Who cares? We're making great time!") In the future, we'll have so much memory that the idea of using variable width in-memory formats will seem absurd. -- Steven -- https://mail.python.org/mailman/listinfo/python-list