[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pac A. He updated ARROW-11456: ------------------------------ Description: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 2.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. Below is code to reproduce the issue: {code:python} from base64 import urlsafe_b64encode import boto3 import numpy as np import pandas as pd import pyarrow as pa import smart_open def num_to_b64(num: int) -> str: return urlsafe_b64encode(num.to_bytes(16, "little")).decode() df = pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("strcol") with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: df.to_parquet(output_file, engine="pyarrow", compression="gzip", index=False) {code} The above code leads to the error: {noformat} pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2500000000 {noformat} was: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 2.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. Below is code to reproduce the issue: {code:python} from base64 import urlsafe_b64encode import boto3 import numpy as np import pandas as pd import pyarrow as pa import smart_open def num_to_b64(num: int) -> str: return urlsafe_b64encode(num.to_bytes(16, "little")).decode() df = pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("string1") with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: df.to_parquet(output_file, engine="pyarrow", compression="gzip", index=False) {code} The above code leads to the error: {noformat} pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2500000000 {noformat} > [Python] Parquet reader cannot read large strings > ------------------------------------------------- > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 > Reporter: Pac A. He > Priority: Major > > When reading a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import boto3 > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("strcol") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 2500000000 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)