Steven D'Aprano <st...@pearwood.info>: > On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote: > > >> Using the surrogate mechanism, UTF-16 can support all 1,114,112 >> potential Unicode characters. >> >> But Unicode doesn't contain 1,114,112 characters—the surrogates are >> excluded from Unicode, and definitely cannot be encoded using >> UTF-anything. > > Surrogates are most certainly part of the Unicode standard, and they are > necessary in UTF-16.
Yes, but UTF-16 produces 16-bit values that are outside Unicode. UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate characters. >>> '\udc10'.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc10' in pos\ ition 0: surrogates not allowed >>> '\udc10'.encode('utf-16') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16' codec can't encode character '\udc10' in po\ sition 0: surrogates not allowed >>> '\udc10'.encode('utf-32') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-32' codec can't encode character '\udc10' in po\ sition 0: surrogates not allowed >> We still don't know if the final result will be UCS-4 everywhere (with >> all 2**32 code points allowed?!) or UTF-8 everywhere. > > Unicode does not have 2**32 code points. It is guaranteed to never > exceed the 2**21 code points already allocated. (Many of those are > still unused.) Never say never. > In the future, we'll have so much memory that the idea of using > variable width in-memory formats will seem absurd. I'm starting to think that future is already here. Marko -- https://mail.python.org/mailman/listinfo/python-list