On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: > >> Once again, you appear to be surprised that invalid data is failing. >> Why is this so strange? U+DD00 is not a valid character. It is quite >> correct to throw this error. > > '\udd00' is a valid str object: > > >>> '\udd00' > '\udd00' > >>> '\udd00'.encode('utf-32') > b'\xff\xfe\x00\x00\x00\xdd\x00\x00' > >>> '\udd00'.encode('utf-16') > b'\xff\xfe\x00\xdd' > > I was simply stating that UTF-8 is not a bijection between unicode > strings and octet strings (even forgetting Python). Enriching Unicode > with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not > without side effects.
But it's not a valid Unicode string, so a Unicode encoding can't be expected to cope with it. Mathematically, 0xC0 0x80 would represent U+0000, and some UTF-8 codecs generate and accept this (in order to allow U+0000 without ever yielding 0x00), but that doesn't mean that UTF-8 should allow that byte sequence. The only reason to craft some kind of Unicode string for any arbitrary sequence of bytes is the "smuggling" effect used for file name handling. There is no reason to support invalid Unicode codepoints. ChrisA -- https://mail.python.org/mailman/listinfo/python-list