Justin Tan created ARROW-2369:
---------------------------------

             Summary: Large (>~20 GB) files written to Parquet via PyArrow are 
corrupted
                 Key: ARROW-2369
                 URL: https://issues.apache.org/jira/browse/ARROW-2369
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0
         Environment: Reproduced on Ubuntu + Mac OSX
            Reporter: Justin Tan
             Fix For: 0.9.0
         Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png

When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via 
the command

{{pq.write_table(my_df, 'table.parquet')}}

The write succeeds, but when the parquet file is loaded, the error message

{{ArrowIOError: Invalid parquet file. Corrupt footer.}}

appears. This same error occurs when the parquet file is written chunkwise as 
well. When the parquet files are small, say < 10 GB or so (drawn randomly from 
the same dataset), everything proceeds as normal.

Details:

Arrow v0.9.0

Reproduced on Ubuntu, Mac osx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to