On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <na...@animats.com> wrote: > On 08/17/2017 05:14 PM, John Nagle wrote: >> I'm cleaning up some data which has text description fields from >> multiple sources. > A few more cases: > > bytearray(b'\xe5\x81ukasz zmywaczyk')
This one has to be Polish, and the first character should be the letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is very similar to the E5 81 that you have. So here's an insane theory: something attempted to lower-case the byte stream as if it were ASCII. If you ignore the high bit, 0xC5 looks like 0x45 or "E", which lower-cases by having 32 added to it, yielding 0xE5. Reversing this transformation yields sane data for several of your strings - they then decode as UTF-8: miguel Ángel santos lidija kmetič Łukasz zmywaczyk jiří urbančík Ľubomír mičko petr urbančík That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones are still a puzzle. ChrisA -- https://mail.python.org/mailman/listinfo/python-list