Dennis Waldron created ARROW-5419:
-------------------------------------

             Summary: [C++] CSV strings_can_be_null option doesn't respect all 
null_values
                 Key: ARROW-5419
                 URL: https://issues.apache.org/jira/browse/ARROW-5419
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
         Environment: Python 3.6.8
PyArrow 0.13.1.dev225+g184b8deb
NumPy 1.16.3
Pandas 0.24.2
            Reporter: Dennis Waldron


Relates to https://issues.apache.org/jira/browse/ARROW-5195 and 
[https://github.com/apache/arrow/issues/4184]

I was testing the new *strings_can_be_null* ConvertOption (built from git 
184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
and noted that when enabled and an empty string is parsed that it doesn't 
return NULL despite '' being in the default null_values list 
([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
{code:java}
options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"};
{code}
Given that the *strings_can_be_null* option was added to expose the same NULL 
processing functionality with respect to strings as *pandas.read_csv,* I 
believe that it should also be able to handle empty strings. ** 

In Pandas:
{code:java}
content = b"a,b\n1,null\n2,\n3,test"
df = pd.read_csv(io.BytesIO(content))
print(df)
   a     b
0  1   NaN
1  2   NaN
2  3  test
{code}
In PyArrow:
{code:java}
convert_options = pc.ConvertOptions(strings_can_be_null=True)
table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
print(table.to_pydict())
OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to