Alexey Strokach created ARROW-1974:
--------------------------------------
Summary: PyArrow segfaults when working with Arrow tables with
duplicate columns
Key: ARROW-1974
URL: https://issues.apache.org/jira/browse/ARROW-1974
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 0.8.0
Environment: Linux Mint 18.2
Anaconda Python distribution + pyarrow installed from the conda-forge channel
Reporter: Alexey Strokach
Priority: Minor
I accidentally created a large number of Parquet files with two
__index_level_0__ columns (through a Spark SQL query).
PyArrow can read these files into tables, but it segfaults when converting the
resulting tables to Pandas DataFrames or when saving the tables to Parquet
files.
{code:python}
# Duplicate columns cause segmentation faults
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.to_pandas() # Segmentation fault
pq.write_table(table, '/some/output.parquet') # Segmentation fault
{code}
If I remove the duplicate column using table.remove_column(...) everything
works without segfaults.
{code:python}
# After removing duplicate columns, everything works fine
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.remove_column(34)
table.to_pandas() # OK
pq.write_table(table, '/some/output.parquet') # OK
{code}
For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py`
here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)