[ 
https://issues.apache.org/jira/browse/ARROW-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-4883:
------------------------------
    External issue URL: https://github.com/apache/arrow/issues/21393

> [Python] read_csv() returns garbage if given file object in text mode
> ---------------------------------------------------------------------
>
>                 Key: ARROW-4883
>                 URL: https://issues.apache.org/jira/browse/ARROW-4883
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.1
>         Environment: Python: 3.7.2, 2.7.15
> PyArrow: 0.12.1
> OS: MacOS 10.13.6 (High Sierra)
>            Reporter: Diego Argueta
>            Priority: Major
>              Labels: csv
>             Fix For: 0.15.0
>
>
> h1. Summary:
> Python 3:
> * {{read_csv}} returns mojibake if given file objects opened in text mode. It 
> behaves as expected in binary mode.
> * Files encoded in anything other than valid UTF-8 will cause a crash.
> Python 2:
> {{read_csv}} only handles ASCII files. If given a file in UTF-8 with 
> characters over U+007F, it crashes.
> h1. To reproduce:
> 1) Create a CSV like this
> {code}
> Header
> 123.45
> {code}
> 2) Then run this code on Python 3:
> {code:python}
> >>> import pyarrow.csv as pa_csv
> >>> pa_csv.read_csv(open('test.csv', 'r'))
> pyarrow.Table
> 䧢: string
> {code}
> Notice the file descriptor is open in text mode. Changing the encoding 
> doesn't help:
> {code:python}
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
> pyarrow.Table
> 䧢: string
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
> pyarrow.Table
> 䧢: string
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
> pyarrow.Table
> 䧢: string
> {code}
> If I open the file in binary mode it works:
> {code:python}
> >>> pa_csv.read_csv(open('test.csv', 'rb'))                                   
> >>>                                                                           
> >>>                 
> pyarrow.Table
> Header: double
> {code}
> I tried this with a file encoded in UTF-16 and it freaked out:
> {code}                                                                        
>                                           
> Traceback (most recent call last):
>   File 
> "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
>  line 84, in _process_text
>     self._execute(line)
>   File 
> "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
>  line 139, in _execute
>     result_str = '%s\n' % repr(result).decode('utf-8')
>   File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
>   File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
>   File 
> "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py",
>  line 143, in frombytes
>     return o.decode('utf8')
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: 
> invalid start byte
> 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
> {code}
> Presumably this is because the code always assumes the file is in UTF-8.
> h2. Python 2 behavior
> Python 2 behaves differently -- it uses the ASCII codec by default, so when 
> handed a file encoded in UTF-8, it will return without an error. Try to 
> access the table...
> {code}
> >>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))
> >>> list(t)
> Traceback (most recent call last):
>   File 
> "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
>  line 84, in _process_text
>     self._execute(line)
>   File 
> "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
>  line 139, in _execute
>     result_str = '%s\n' % repr(result).decode('utf-8')
>   File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
>     result.write('\n{}'.format(str(self.data)))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: 
> ordinal not in range(128)
> 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
> {code}
> h1. Expectation
> We should be able to hand read_csv() a file in text mode so that the CSV file 
> can be in any text encoding. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to