Dennis Waldron created ARROW-5419: ------------------------------------- Summary: [C++] CSV strings_can_be_null option doesn't respect all null_values Key: ARROW-5419 URL: https://issues.apache.org/jira/browse/ARROW-5419 Project: Apache Arrow Issue Type: Bug Components: C++, Python Environment: Python 3.6.8 PyArrow 0.13.1.dev225+g184b8deb NumPy 1.16.3 Pandas 0.24.2 Reporter: Dennis Waldron
Relates to https://issues.apache.org/jira/browse/ARROW-5195 and [https://github.com/apache/arrow/issues/4184] I was testing the new *strings_can_be_null* ConvertOption (built from git 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader and noted that when enabled and an empty string is parsed that it doesn't return NULL despite '' being in the default null_values list ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)] {code:java} options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null"}; {code} Given that the *strings_can_be_null* option was added to expose the same NULL processing functionality with respect to strings as *pandas.read_csv,* I believe that it should also be able to handle empty strings. ** In Pandas: {code:java} content = b"a,b\n1,null\n2,\n3,test" df = pd.read_csv(io.BytesIO(content)) print(df) a b 0 1 NaN 1 2 NaN 2 3 test {code} In PyArrow: {code:java} convert_options = pc.ConvertOptions(strings_can_be_null=True) table = pc.read_csv(io.BytesIO(content), convert_options=convert_options) print(table.to_pydict()) OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])]) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)