[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278976#comment-17278976 ]
Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:09 PM: ------------------------------------------------------------ Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I read a dataframe with 10 string columns and 2 billion rows without issue. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. was (Author: apacman): Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I tried reading a dataframe with 10 string columns and 2 billion rows. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. > [Python] Parquet reader cannot read large strings > ------------------------------------------------- > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 > Reporter: Pac A. He > Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)