[ https://issues.apache.org/jira/browse/ARROW-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-4883: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21393 > [Python] read_csv() returns garbage if given file object in text mode > --------------------------------------------------------------------- > > Key: ARROW-4883 > URL: https://issues.apache.org/jira/browse/ARROW-4883 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.12.1 > Environment: Python: 3.7.2, 2.7.15 > PyArrow: 0.12.1 > OS: MacOS 10.13.6 (High Sierra) > Reporter: Diego Argueta > Priority: Major > Labels: csv > Fix For: 0.15.0 > > > h1. Summary: > Python 3: > * {{read_csv}} returns mojibake if given file objects opened in text mode. It > behaves as expected in binary mode. > * Files encoded in anything other than valid UTF-8 will cause a crash. > Python 2: > {{read_csv}} only handles ASCII files. If given a file in UTF-8 with > characters over U+007F, it crashes. > h1. To reproduce: > 1) Create a CSV like this > {code} > Header > 123.45 > {code} > 2) Then run this code on Python 3: > {code:python} > >>> import pyarrow.csv as pa_csv > >>> pa_csv.read_csv(open('test.csv', 'r')) > pyarrow.Table > 䧢: string > {code} > Notice the file descriptor is open in text mode. Changing the encoding > doesn't help: > {code:python} > >>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8')) > pyarrow.Table > 䧢: string > >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii')) > pyarrow.Table > 䧢: string > >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1')) > pyarrow.Table > 䧢: string > {code} > If I open the file in binary mode it works: > {code:python} > >>> pa_csv.read_csv(open('test.csv', 'rb')) > >>> > >>> > pyarrow.Table > Header: double > {code} > I tried this with a file encoded in UTF-16 and it freaked out: > {code} > > Traceback (most recent call last): > File > "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", > line 84, in _process_text > self._execute(line) > File > "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", > line 139, in _execute > result_str = '%s\n' % repr(result).decode('utf-8') > File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__ > File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__ > File > "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", > line 143, in frombytes > return o.decode('utf8') > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: > invalid start byte > 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte > {code} > Presumably this is because the code always assumes the file is in UTF-8. > h2. Python 2 behavior > Python 2 behaves differently -- it uses the ASCII codec by default, so when > handed a file encoded in UTF-8, it will return without an error. Try to > access the table... > {code} > >>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r')) > >>> list(t) > Traceback (most recent call last): > File > "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", > line 84, in _process_text > self._execute(line) > File > "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", > line 139, in _execute > result_str = '%s\n' % repr(result).decode('utf-8') > File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__ > result.write('\n{}'.format(str(self.data))) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: > ordinal not in range(128) > 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128) > {code} > h1. Expectation > We should be able to hand read_csv() a file in text mode so that the CSV file > can be in any text encoding. -- This message was sent by Atlassian Jira (v8.20.10#820010)