On Tue, Jun 28, 2016 at 5:25 PM, Michael Welle <mwe012...@gmx.net> wrote: > I want to use Python 3 to process data, that unfortunately come with > different encodings. So far I have found ascii, iso-8859, utf-8, > windows-1252 and maybe some more in the same file (don't ask...). I read > the data via sys.stdin and the idea is to read a line, detect the > current encoding, hit it until it looks like utf-8 and then go on with > the next line of input: > > > import cchardet > > for line in sys.stdin.buffer: > > encoding = cchardet.detect(line)['encoding'] > line = line.decode(encoding, 'ignore')\ > .encode('UTF-8').decode('UTF-8', 'ignore') > > > After that line should be a string. The logging module and some others > choke on line: UnicodeEncodeError: 'charmap' codec can't encode > character. What would be a right approach to tackle that problem > (assuming that I can't change the input data)?
This is the exact sort of "ewwww" that I have to cope with in my MUD client. Sometimes it gets sent UTF-8, other times it gets sent... uhhhh... some eight-bit encoding, most likely either 8859 or 1252 (but could theoretically be anything). The way I cope with it is to do a line-by-line decode, similar to what you're doing, but with a much simpler algorithm - something like this: for line in <binary source>: try: line = line.decode("UTF-8") except UnicodeDecodeError: line = line.decode("1252") yield line There's no need to chardet for UTF-8; if you successfully decode the text, it's almost certainly correct. (This includes pure ASCII text, which would also decode successfully and correctly as ISO-8859 or Windows-1252.) You shouldn't need this complicated triple-encode dance. Just decode it once and work with text from there on. Ideally, you should be using Python 3, where "work[ing] with text" is exactly how most of the code wants to work; if not, resign yourself to reprs with u-prefixes, and work with Unicode strings anyway. It'll save you a lot of trouble. ChrisA -- https://mail.python.org/mailman/listinfo/python-list