Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > For those cases where you do wish to take an arbitrary byte stream and > round-trip it, Python now provides an error handler for that. > > py> import random > py> b = bytes([random.randint(0, 255) for _ in range(10000)]) > py> s = b.decode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0: > invalid start byte > py> s = b.decode('utf-8', errors='surrogateescape') > py> s.encode('utf-8', errors='surrogateescape') == b > True
That is indeed a valid workaround. With it we achieve b.decode('utf-8', errors='surrogateescape'). \ encode('utf-8', errors='surrogateescape') == b for any bytes b. It goes to great lengths to address the Linux programmer's situation. However, * it's not UTF-8 but a variant of it, * it sacrifices the ordering correspondence of UTF-8: >>> '\udc80' > 'ä' True >>> '\udc80'.encode('utf-8', errors='surrogateescape') > \ ... 'ä'.encode('utf-8', errors='surrogateescape') False * it still isn't bijective between str and bytes: >>> '\udd00'.encode('utf-8', errors='surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udd00' in position 0: surrogates not allowed Marko -- https://mail.python.org/mailman/listinfo/python-list