On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly <ian.g.ke...@gmail.com> wrote: > On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly <ian.g.ke...@gmail.com> wrote: >> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle <na...@animats.com> wrote: >>> A few more cases: >>> >>> bytearray(b'miguel \xe3\x81ngel santos') >> >> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the >> rest of the name. >> >>> bytearray(b'\xe5\x81ukasz zmywaczyk') >> >> If that were b'\xc5\x81' it would be Ł in UTF-8 which would fit the >> rest of the name. >> >> I suspect the others contain similar errors. I don't know if it's the >> result of some form of Mojibake or maybe just transcription errors. > > Oh shit, I think know what happened. In ASCII you can lower-case > letters by just adding 32 (0x20) to them. Somebody tried to do that > here and fucked up the encoding. That's why all the ASCII letters in > the strings are lower-case while these ones aren't.
That applies to some, but not all. > bytearray(b'M\x81\x81\xfcnster') This should be Münster, which is a U+00FC. You have 81 81 FC. I don't know of any encoding that does this, but it looks indicative - and it's not the lower-casing. And the 0x9d doesn't either, but maybe that's some relation to 0x2d which is an ASCII hyphen? ChrisA -- https://mail.python.org/mailman/listinfo/python-list