* MRAB: > On 2019-03-19 20:32, Florian Weimer wrote: >> I've seen occasional proposals like this one coming up: >> >> | I therefore suggested 1999-11-02 on the unic...@unicode.org mailing >> | list the following approach. Instead of using U+FFFD, simply encode >> | malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed >> | UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and >> | each of these bytes can be represented using a 16-bit value from the >> | UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong >> | "K" (U+004B) 0xc1 0x8b from the above example would be represented >> | in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8 >> | encoded surrogate character is also treated like a malformed >> | sequence, then there is no way that a single high-half surrogate >> | could precede the encoded malformed sequence and cause a valid >> | UTF-16 sequence to emerge. >> >> <http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html> >> >> Has this ever been implemented in any Python version? I seem to >> remember something like that, but all I could find was me talking >> about this in 2000.
> Python 3 has "surrogate escape". Have a read of PEP 383 -- Non-decodable > Bytes in System Character Interfaces. Thanks, this is the information I was looking for. -- https://mail.python.org/mailman/listinfo/python-list