Joris Van den Bossche created ARROW-9078:
--------------------------------------------
Summary: [C++] Parquet writing of extension type with nested
storage type fails
Key: ARROW-9078
URL: https://issues.apache.org/jira/browse/ARROW-9078
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
A reproducer in Python:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
class MyStructType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()),
('right', pa.int64())]))
def __reduce__(self):
return MyStructType, ()
struct_array = pa.StructArray.from_arrays(
[
pa.array([0, 1], type="int64", from_pandas=True),
pa.array([1, 2], type="int64", from_pandas=True),
],
names=["left", "right"],
)
# works
table = pa.table({'a': struct_array})
pq.write_table(table, "test_struct.parquet")
# doesn't work
mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
table = pa.table({'a': mystruct_array})
pq.write_table(table, "test_struct.parquet")
{code}
Writing the simple StructArray nowadays works (and reading it back in as well).
But when the struct array is the storage array of an ExtensionType, it fails
with the following error:
{code}
ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)