Steve D'Aprano <steve+pyt...@pearwood.info> writes: > [...] > Another factor which I didn't see discussed anywhere is that Python > strings treat surrogates as normal code points. I believe that would > be troublesome for a UTF-8 implementation: > > py> '\uDC37'.encode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in > position 0: surrogates not allowed > > but of course with a UCS-2 or UTF-32 implementation it is trivial: you > just treat the surrogate as another code point like any other.
Thanks for a very thorough reply, most useful. I'm going to pick you up on the above, though. Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC 3629 (2003). There is CESU-8 if you really need a naive encoding of UTF-16 to UTF-8-alike. py> low = '\uDC37' is only meaningful on narrow builds pre Python 3.3 where the user must do extra to correctly handle characters outside the BMP. -- Pete Forman https://payg-petef.rhcloud.com -- https://mail.python.org/mailman/listinfo/python-list