Re: What extended ASCII character set uses 0x9D?

MRAB Thu, 17 Aug 2017 19:19:44 -0700

On 2017-08-18 01:53, Chris Angelico wrote:

On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <[email protected]> wrote:

On 08/17/2017 05:14 PM, John Nagle wrote:

     I'm cleaning up some data which has text description fields from
multiple sources.

A few more cases:


bytearray(b'\xe5\x81ukasz zmywaczyk')


This one has to be Polish, and the first character should be the
letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
very similar to the E5 81 that you have.

So here's an insane theory: something attempted to lower-case the byte
stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
like 0x45 or "E", which lower-cases by having 32 added to it, yielding
0xE5. Reversing this transformation yields sane data for several of
your strings - they then decode as UTF-8:

miguel Ángel santos


I think that's:

miguel ángel santos

lidija kmetič
Łukasz zmywaczyk
jiří urbančík
Ľubomír mičko
petr urbančík

That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
are still a puzzle.

--
https://mail.python.org/mailman/listinfo/python-list

Re: What extended ASCII character set uses 0x9D?

Reply via email to