New submission from Christoph Rauch: I have uncovered a strange behavior in io.TextIOWrapper which I think is a bug.
#!/usr/bin/env python # encoding: utf-8 import csv import io raw_file = io.FileIO('utf-8-encoded.csv', 'rb') stream = io.BufferedReader(raw_file) stream = io.TextIOWrapper(stream, encoding="UTF-8") reader = csv.reader(stream, delimiter=";") cells = 0 for row in reader: # Cells should contain 4 Unicode characters. assert all([len(cell.decode('utf-8')) == 4 for cell in row]), row cells += len(row) assert cells == 210, cells This produces a not very useful: Traceback (most recent call last): File "utf8-textio-test.py", line 15, in <module> for row in reader: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128) The only way to let it *not* crash is to set encoding to ascii and errors to ignore, but this clears out all the characters with ord>128, clearly not useful as well, so I hope this behavior is not intended. I appended a file with which to test this problem. ---------- components: IO files: utf-8-encoded.csv messages: 181028 nosy: Christoph.Rauch priority: normal severity: normal status: open title: io.TextIOWrapper does not handle UTF-8 encoded streams correctly type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file28922/utf-8-encoded.csv _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue17090> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com