Marko Rauhamaa wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > >> Marko Rauhamaa wrote: >> >>> That said, UTF-8 does suffer badly from its not being >>> a bijective mapping. >> >> Can you explain? > > In Python terms, there are bytes objects b that don't satisfy: > > b.decode('utf-8').encode('utf-8') == b
Are you talking about the fact that not all byte streams are valid UTF-8? That is, some byte objects b may raise an exception on b.decode('utf-8'). I don't see why that means UTF-8 "suffers badly" from this. Can you give an example of where you would expect to take an arbitrary byte-stream, decode it as UTF-8, and expect the results to be meaningful? For those cases where you do wish to take an arbitrary byte stream and round-trip it, Python now provides an error handler for that. py> import random py> b = bytes([random.randint(0, 255) for _ in range(10000)]) py> s = b.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0: invalid start byte py> s = b.decode('utf-8', errors='surrogateescape') py> s.encode('utf-8', errors='surrogateescape') == b True -- Steven -- https://mail.python.org/mailman/listinfo/python-list