[ https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5419: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21872 > [C++] CSV strings_can_be_null option doesn't respect all null_values > -------------------------------------------------------------------- > > Key: ARROW-5419 > URL: https://issues.apache.org/jira/browse/ARROW-5419 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Python 3.6.8 > PyArrow 0.13.1.dev225+g184b8deb > NumPy 1.16.3 > Pandas 0.24.2 > Reporter: Dennis Waldron > Assignee: Antoine Pitrou > Priority: Minor > Labels: csv, pull-request-available > Fix For: 0.14.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184] > I was testing the new *strings_can_be_null* ConvertOption (built from git > 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader > and noted that when enabled and an empty string is parsed that it doesn't > return NULL despite '' being in the default null_values list > ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)] > {code:java} > options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", > "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA", > "NULL", "NaN", "n/a", "nan", "null"}; > {code} > Given that the *strings_can_be_null* option was added to expose the same NULL > processing functionality with respect to strings as *pandas.read_csv,* I > believe that it should also be able to handle empty strings. ** > In Pandas: > {code:java} > content = b"a,b\n1,null\n2,\n3,test" > df = pd.read_csv(io.BytesIO(content)) > print(df) > a b > 0 1 NaN > 1 2 NaN > 2 3 test > {code} > In PyArrow: > {code:java} > convert_options = pc.ConvertOptions(strings_can_be_null=True) > table = pc.read_csv(io.BytesIO(content), convert_options=convert_options) > print(table.to_pydict()) > OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])]) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)