On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico <ros...@gmail.com> wrote: > On Fri, Aug 18, 2017 at 10:14 AM, John Nagle <na...@animats.com> wrote: >> I'm cleaning up some data which has text description fields from >> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. >> And some are in some other character set. So I have to examine and >> sanity check each field in a database dump, deciding which character >> set best represents what's there. >> >> Here's a hard case: >> >> g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time') >> >> g1.decode("utf8") >> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 21: >> invalid start byte >> >> g1.decode("windows-1252") >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 21: >> character maps to <undefined> >> >> 0x9d is unmapped in "windows-1252", according to >> >> https://en.wikipedia.org/wiki/Windows-1252 >> >> So the Python codec isn't wrong here. >> >> Trying "latin-1" >> >> g1.decode("latin-1") >> '\\"Perfect Gift Idea\\"\x9d Each time' >> >> That just converts 0x9d in the input to 0x9d in Unicode. >> That's "Operating System Command" (the "Windows" key?) >> That's clearly wrong; some kind of quote was intended. >> Any ideas? > > Another possibility is that it's some kind of dash or ellipsis or > something, but I can't find anything that does. (You already have > quote characters in there.) The nearest I can actually find is: > >>>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256") > '\\"Perfect Gift Idea\\"\u200c Each time' >>>> unicodedata.name("\u200c") > 'ZERO WIDTH NON-JOINER' > > which, honestly, doesn't make a lot of sense either. :(
In CP437 it's ¥ which makes some sense in the "gift idea" context. But then I'd expect a number to appear with it. It could also just be junk data. -- https://mail.python.org/mailman/listinfo/python-list