Re: Newbie question about text encoding

Steven D'Aprano Sat, 07 Mar 2015 07:47:26 -0800

Marko Rauhamaa wrote:

> That said, UTF-8 does suffer badly from its not being
> a bijective mapping.


Can you explain?

As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.

There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.

E.g. SMP code points should be encoded to four bytes using UTF-8:

py> u'\U0010FF01'.encode('utf-8')  # U+10FF01
'\xf4\x8f\xbc\x81'


But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:

py> u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'


then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:

py> u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py> u'\udf01'.encode('utf-8')
'\xed\xbc\x81'


giving six bytes in total:

'\xed\xaf\xbf\xed\xbc\x81'


This is not UTF-8! But some software mislabels it as UTF-8.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to