[ https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661022#comment-17661022 ]
Rok Mihevc commented on ARROW-3999: ----------------------------------- This issue has been migrated to [issue #20601|https://github.com/apache/arrow/issues/20601] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python] Can't read large file that pyarrow wrote > ------------------------------------------------- > > Key: ARROW-3999 > URL: https://issues.apache.org/jira/browse/ARROW-3999 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.11.1 > Environment: OS: OSX High Sierra 10.13.6 > Python: 3.7.0 > PyArrow: 0.11.1 > Pandas: 0.23.4 > Reporter: Diego Argueta > Priority: Major > > I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a > Parquet file using the DataFrame's {{to_parquet}} method. However, reading > that same file back results in an exception. The DataFrame consists of about > 32 million rows with seven columns; four are ASCII text and three are > booleans. > > {code:java} > >>> source_df.shape > (32070402, 7) > >>> source_df.dtypes > Url Source object > Url Destination object > Anchor text object > Follow / No-Follow object > Link No-Follow bool > Meta No-Follow bool > Robot No-Follow bool > dtype: object > >>> source_df.to_parquet('export.parq', compression='gzip', > use_deprecated_int96_timestamps=True) > >>> loaded_df = pd.read_parquet('export.parq') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", > line 288, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", > line 131, in read > **kwargs).to_pandas() > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1074, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", > line 184, in read_parquet > use_pandas_metadata=use_pandas_metadata) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", > line 943, in read > use_pandas_metadata=use_pandas_metadata) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", > line 500, in read > table = reader.read(**options) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", > line 187, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 721, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483685 > Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483685 > {code} > > One would expect that if PyArrow can write a file successfully, it can read > it back as well. Fortunately the {{fastparquet}} library has no problem > reading this file, so we didn't lose any data, but the roundtripping problem > was a bit of a surprise. -- This message was sent by Atlassian Jira (v8.20.10#820010)