On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: > Steven D'Aprano <st...@pearwood.info>: > >> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote: >> >> >>> Using the surrogate mechanism, UTF-16 can support all 1,114,112 >>> potential Unicode characters. >>> >>> But Unicode doesn't contain 1,114,112 characters—the surrogates are >>> excluded from Unicode, and definitely cannot be encoded using >>> UTF-anything. >> >> Surrogates are most certainly part of the Unicode standard, and they are >> necessary in UTF-16. > > Yes, but UTF-16 produces 16-bit values that are outside Unicode.
Show me. Before you answer, if your answer is "surrogate pairs", that is incorrect. Surrogate pairs is how UTF-16 encodes astral characters. For example, the UTF-16 *byte sequence* 0xD800 0xDC00 does not represent "code points U+D800,DC00". It represents the *single* code point U+10000 "LINEAR B SYLLABLE B008 A". The code points U+D800 and U+DC00 are reserved for the use of UTF-16 as surrogates. This means that UTF-16 cannot encode lone surrogates. It cannot encode, say, the code point U+D800 on its own, because it looks like half of a SMP code point, which is an error. And it cannot encode U+D800 immediately followed by U+DC00, because that would be interpreted as U+10000. So there is a range of code points which cannot be represented in UTF-16. Where UTF-16 goes, UTF-8 and UTF-32 must follow. It is a requirement of Unicode that you must be able to freely and losslessly convert between the three UTFs. (I'm not sure if that also applies to UTF-7.) Since UTF-16 *cannot* represent this specific range of code points, then UTF-8 and UTF-32 must be *forbidden* from doing the same. Note that the UTF-8 and UTF-32 formats are perfectly capable of representing lone surrogates. UTF-32, for example would simply pad the code point with zeroes: U+D800 would be represented as the four bytes 0x0000D800. UTF-8 has a well-defined 3-byte sequence that corresponds to it. But that is invalid, since it violates the requirement that it be freely and losslessly translatable into UTF-16. Invalid Unicode strings have their uses, but they are not valid :-) > UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate > characters. Correct. But encoding of surrogates is not required in Unicode. Strictly speaking, it is forbidden. Did you read the link from the Unicode consortium that I provided? >>> We still don't know if the final result will be UCS-4 everywhere (with >>> all 2**32 code points allowed?!) or UTF-8 everywhere. >> >> Unicode does not have 2**32 code points. It is guaranteed to never >> exceed the 2**21 code points already allocated. (Many of those are >> still unused.) > > Never say never. The Unicode standard has published this guarantee. It is not going to change. If somebody wants more than 2**21 code points, they can start their own new, competing, standard. >> In the future, we'll have so much memory that the idea of using >> variable width in-memory formats will seem absurd. > > I'm starting to think that future is already here. I'm not *quite* ruling out the possibility that UTF-8 as internal representation for in-memory strings is a good idea, but I think that for non-embedded systems, it is very probably a waste of time. -- Steven -- https://mail.python.org/mailman/listinfo/python-list