Alexey Strokach created ARROW-1974: -------------------------------------- Summary: PyArrow segfaults when working with Arrow tables with duplicate columns Key: ARROW-1974 URL: https://issues.apache.org/jira/browse/ARROW-1974 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.8.0 Environment: Linux Mint 18.2 Anaconda Python distribution + pyarrow installed from the conda-forge channel Reporter: Alexey Strokach Priority: Minor
I accidentally created a large number of Parquet files with two __index_level_0__ columns (through a Spark SQL query). PyArrow can read these files into tables, but it segfaults when converting the resulting tables to Pandas DataFrames or when saving the tables to Parquet files. {code:python} # Duplicate columns cause segmentation faults table = pq.read_table('/path/to/duplicate_column_file.parquet') table.to_pandas() # Segmentation fault pq.write_table(table, '/some/output.parquet') # Segmentation fault {code} If I remove the duplicate column using table.remove_column(...) everything works without segfaults. {code:python} # After removing duplicate columns, everything works fine table = pq.read_table('/path/to/duplicate_column_file.parquet') table.remove_column(34) table.to_pandas() # OK pq.write_table(table, '/some/output.parquet') # OK {code} For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)