Marko Rauhamaa wrote: > Chris Angelico <ros...@gmail.com>: > >> Once again, you appear to be surprised that invalid data is failing. >> Why is this so strange? U+DD00 is not a valid character.
But it is a valid non-character code point. >> It is quite correct to throw this error. > > '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. > >>> '\udd00' > '\udd00' > >>> '\udd00'.encode('utf-32') > b'\xff\xfe\x00\x00\x00\xdd\x00\x00' > >>> '\udd00'.encode('utf-16') > b'\xff\xfe\x00\xdd' If you explicitly specify the endianness (say, utf-16-be or -le) then you don't get the BOMs. > I was simply stating that UTF-8 is not a bijection between unicode > strings and octet strings (even forgetting Python). Enriching Unicode > with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not > without side effects. -- Steven -- https://mail.python.org/mailman/listinfo/python-list