Karl Dunkle Werner created ARROW-7345:
-----------------------------------------
Summary: [Python] Writing partitions with NaNs silently drops data
Key: ARROW-7345
URL: https://issues.apache.org/jira/browse/ARROW-7345
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Reporter: Karl Dunkle Werner
When writing a partitioned table, if the partitioning column has NA values,
they're silently dropped. I think it would be helpful if there was a warning.
Even better, from my perspective, would be writing out those partitions with a
directory name like {{partition_col=NaN}}.
Here's a small example where only the {{b = 2}} group is written out and the
{{b = NaN}} group is dropped.
{code:python}
import os
import tempfile
import pyarrow.json
import pyarrow.parquet
from pathlib import Path
# Create a dataset with NaN:
json_str = """
{"a": 1, "b": 2}
{"a": 2, "b": null}
"""
with tempfile.NamedTemporaryFile() as tf:
tf = Path(tf.name)
tf.write_text(json_str)
table = pyarrow.json.read_json(tf)
# Write out a partitioned dataset, using the NaN-containing column
with tempfile.TemporaryDirectory() as out_dir:
pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
print(os.listdir(out_dir))
read_table = pyarrow.parquet.read_table(out_dir)
print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")
# Output:
#> ['b=2.0']
#> Wrote out 2 rows, read back 1 row
{code}
It looks like this caused by pandas dropping NaNs when doing [the {{groupby}}
here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)