Alexey Strokach created ARROW-1974:
--------------------------------------

             Summary: PyArrow segfaults when working with Arrow tables with 
duplicate columns
                 Key: ARROW-1974
                 URL: https://issues.apache.org/jira/browse/ARROW-1974
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.8.0
         Environment: Linux Mint 18.2
Anaconda Python distribution + pyarrow installed from the conda-forge channel
            Reporter: Alexey Strokach
            Priority: Minor


I accidentally created a large number of Parquet files with two 
__index_level_0__ columns (through a Spark SQL query).

PyArrow can read these files into tables, but it segfaults when converting the 
resulting tables to Pandas DataFrames or when saving the tables to Parquet 
files.

{code:python}
# Duplicate columns cause segmentation faults
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.to_pandas()  # Segmentation fault
pq.write_table(table, '/some/output.parquet') # Segmentation fault
{code}

If I remove the duplicate column using table.remove_column(...) everything 
works without segfaults.

{code:python}
# After removing duplicate columns, everything works fine
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.remove_column(34)
table.to_pandas()  # OK
pq.write_table(table, '/some/output.parquet')  # OK
{code}

For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to