Marko Rauhamaa wrote: > That said, UTF-8 does suffer badly from its not being > a bijective mapping.
Can you explain? As far as I am aware, every code point has one and only one valid UTF-8 encoding, and every UTF-8 encoding has one and only one valid code point. There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule that valid UTF-8 encodings are the shortest possible. E.g. SMP code points should be encoded to four bytes using UTF-8: py> u'\U0010FF01'.encode('utf-8') # U+10FF01 '\xf4\x8f\xbc\x81' But in CESU-8, the code point is first interpreted as a UTF-16 surrogate pair: py> u'\U0010FF01'.encode('utf-16be') '\xdb\xff\xdf\x01' then each surrogate pair is treated as a 16-bit code unit and individually encoded to three bytes using UTF-8: py> u'\udbff'.encode('utf-8') '\xed\xaf\xbf' py> u'\udf01'.encode('utf-8') '\xed\xbc\x81' giving six bytes in total: '\xed\xaf\xbf\xed\xbc\x81' This is not UTF-8! But some software mislabels it as UTF-8. -- Steven -- https://mail.python.org/mailman/listinfo/python-list