On 2017-01-13, D'Arcy Cain <da...@vybenetworks.com> wrote: > I thought I was done with this crap once I moved to 3.x but some > Winblows machines are still sending what some circles call "Extended > ASCII". I have a file that I am trying to read and it is barfing on > some characters. For example: > > due to the Qu\xe9bec government > > Obviously should be "due to the Québec government". I can't figure out > what that encoding is or if it is anything that can even be understood > outside of M$.
$ cat decode.py #!/usr/bin/env python3 CODECS = ( "ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720", "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14", "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048", "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8", "utf_8_sig", ) for encoding in CODECS: try: if b"Qu\xe9bec".decode(encoding) == "Québec": print(encoding) except (UnicodeError, LookupError): pass $ ./decode.py cp1250 cp1252 cp1254 cp1256 cp1257 cp1258 latin_1 iso8859_2 iso8859_3 iso8859_4 iso8859_9 iso8859_10 iso8859_13 iso8859_14 iso8859_15 iso8859_16 -- https://mail.python.org/mailman/listinfo/python-list