> Thanks a lot Martin and Marc for the really great explanations! I was > wondering if it would be reasonable to imagine a utility that will > determine whether, for a given encoding, two byte strings would be > equivalent.
But that is much easier to answer: s1.decode(enc) == s2.decode(enc) Assuming Unicode's unification, for a single encoding, this should produce correct results in all cases I'm aware of. If the you also have different encodings, you should add def normal_decode(s, enc): return unicode.normalize("NFKD", s.decode(enc)) normal_decode(s1, enc) == normal_decode(s2, enc) This would flatten out compatibility characters, and ambiguities left in Unicode itself. > But I think such a utility will require *extensive* > knowledge about many bizarrities of many encodings -- and has little > chance of being pretty! See above. > In any case, it goes well beyond the situation that triggered my > original question in the first place, that basically was to provide a > reasonable check on whether round-tripping a string is successful -- > this is in the context of a small utility to guess an encoding and to > use it to decode a byte string. This utility module was triggered by > one that Skip Montanaro had written some time ago, but I wanted to add > and combine several ideas and techniques (and support for my usage > scenarios) for guessing a string's encoding in one convenient place. Notice that this algorithm is not capable of detecting the ISO-2022 encodings - they look like ASCII to this algorithm. This is by design, as the encoding was designed to only use 7-bit bytes, so that you can safely transport them in Email and such (*) If you want to add support for ISO-2022, you should look for escape characters, and then check whether the escape sequences are among the ISO-2022 ones: - ESC ( - 94-character graphic character set, G0 - ESC ) - 94-character graphic character set, G1 - ESC * - 94-character graphic character set, G2 - ESC + - 94-character graphic character set, G3 - ESC - - 96-character graphic character set, G1 - ESC . - 96-character graphic character set, G2 - ESC / - 96-character graphic character set, G3 - ESC $ - Multibyte ( G0 ) G1 * G2 + G3 - ESC % - Non-ISO-2022 (e.g. UTF-8) If you see any of these, it should be ISO-2022; see the Wiki page as to what subset may be in use. G0..G3 means what register the character set is loaded into; when you have loaded a character set into a register, you can switch between registers through ^N (to G1), ^O (to G0), ESC n (to G2), ESC o (to G3) (*) > http://gizmojo.org/code/decodeh/ > > I will be very interested in any remarks any of you may have! >From a shallow inspection, it looks right. I would have spelled "losses" as "loses". Regards, Martin (*) For completeness: ISO-2022 also supports 8-bit characters, and there are more control codes to shift between the various registers. -- http://mail.python.org/mailman/listinfo/python-list