On 2017-08-18 01:14, John Nagle wrote:
It's preceded by something in quotes, so it might be ™ (trademark symbol, '\u2122') or something similar. No idea which encoding that would be, though.I'm cleaning up some data which has text description fields from multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. And some are in some other character set. So I have to examine and sanity check each field in a database dump, deciding which character set best represents what's there.Here's a hard case: g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time') g1.decode("utf8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 21: invalid start byte g1.decode("windows-1252") UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 21: character maps to <undefined> 0x9d is unmapped in "windows-1252", according to https://en.wikipedia.org/wiki/Windows-1252 So the Python codec isn't wrong here. Trying "latin-1" g1.decode("latin-1") '\\"Perfect Gift Idea\\"\x9d Each time' That just converts 0x9d in the input to 0x9d in Unicode. That's "Operating System Command" (the "Windows" key?) That's clearly wrong; some kind of quote was intended. Any ideas?
-- https://mail.python.org/mailman/listinfo/python-list
