On Dec 27, 7:37 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Certainly. ISO-2022 is famous for having ambiguous encodings. Try > these: > > unicode("Hallo","iso-2022-jp") > unicode("\x1b(BHallo","iso-2022-jp") > unicode("\x1b(JHallo","iso-2022-jp") > unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp") > > or likewise > > unicode("[EMAIL PROTECTED]","iso-2022-jp") > unicode("\x1b$BBB","iso-2022-jp") > > In iso-2022-jp-3, there are even more ways to encode the same string.
Wow, that's not easy to see why would anyone ever want that? Is there any logic behind this? In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean that the ignored/lost bytes in the original strings are not illegal but *represent nothing* in this encoding? I.e. in practice (in a context limited to the encoding in question) should this be considered as a data loss, or should these strings be considered "equivalent"? Thanks! mario -- http://mail.python.org/mailman/listinfo/python-list