[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279136#comment-17279136
 ] 

Weston Pace edited comment on ARROW-11456 at 2/4/21, 8:26 PM:
--------------------------------------------------------------

The 31 bit limit you are referencing is not the 31 bit limit that is at play 
here and not really relevant.  There is another 31 bit limit that has to do 
with how arrow stores strings.  Parquet does not need to support random access 
of strings.  The way it stores byte arrays & byte array lengths does not 
support random access.  You could not fetch the ith string of a parquet encoded 
utf8 byte array.

Arrow does need to support this use case.  It stores strings using two arrays.  
The first is an array of offsets.  The second is an array of bytes.  To fetch 
the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the 
range that needs to be fetched from the array of bytes.

There are two string types in Arrow, "string" and "large_string".  The "string" 
data type uses 4 byte signed integer offsets while the "large_string" data type 
uses 8 byte signed integer offsets.  So it is not possible to create a "string" 
array with data containing more than 2 billion bytes.

Now, this is not normally a problem.  Arrow can fall back to a chunked array 
(which is why the 31 bit limit you reference isn't always such an issue).
{code:java}
>>> import pyarrow as pa
>>> x = '0' * 1024
>>> y = [x] * (1024 * 1024 * 2)
>>> len(y)
2097152 // # of strings
>>> len(y) * 1024
2147483648 // # of bytes
>>> a = pa.array(y)
>>> len(a.chunks)
2
>>> len(a.chunks[0])
2097151
>>> len(a.chunks[1])
1
{code}
However, it does seem that, if there are 2 billion strings (as opposed to just 
2 billion bytes), the chunked array fallback is not applying.
{code:java}
>>> x = '0' * 8
>>> y = [x] * (1024 * 1024 * 1024 * 2)
>>> len(y)
2147483648
>>> len(y) * 8
17179869184
>>> a = pa.array(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 296, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 109, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 2147483648
{code}
This "should" be representable using a chunked array with two chunks.  It is 
possible this is the source of your issue.  Or maybe when reading from parquet 
the "fallback to chunked array" logic simply doesn't apply.  I don't know the 
parquet code well enough.  That is one of the reasons it would be helpful to 
have a reproducible test.

It also might be easier to just write your parquet out to multiple files or 
multiple row groups.  Both of these approaches should not only avoid this issue 
but also reduce the memory pressure when you are converting to pandas.


was (Author: westonpace):
The 31 bit limit you are referencing is not the 31 bit limit that is at play 
here and not really relevant.  There is another 31 bit limit that has to do 
with how arrow stores strings.  Parquet does not need to support random access 
of strings.  The way it stores byte arrays & byte array lengths does not 
support random access.  You could not fetch the ith string of a parquet encoded 
utf8 byte array.

Arrow does need to support this use case.  It stores strings using two arrays.  
The first is an array of offsets.  The second is an array of bytes.  To fetch 
the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the 
range that needs to be fetched from the array of bytes.

There are two string types in Arrow, "string" and "large_string".  The "string" 
data type uses 4 byte signed integer offsets while the "large_string" data type 
uses 8 byte signed integer offsets.  So it is not possible to create a "string" 
array with data containing more than 2 billion bytes.

Now, this is not normally a problem.  Arrow can fall back to a chunked array 
(which is why the 31 bit limit you reference isn't such an issue).
{code:java}
>>> import pyarrow as pa
>>> x = '0' * 1024
>>> y = [x] * (1024 * 1024 * 2)
>>> len(y)
2097152 // # of strings
>>> len(y) * 1024
2147483648 // # of bytes
>>> a = pa.array(y)
>>> len(a.chunks)
2
>>> len(a.chunks[0])
2097151
>>> len(a.chunks[1])
1
{code}
However, it does seem that, if there are 2 billion strings (as opposed to just 
2 billion bytes), the chunked array fallback is not applying.
{code:java}
>>> x = '0' * 8
>>> y = [x] * (1024 * 1024 * 1024 * 2)
>>> len(y)
2147483648
>>> len(y) * 8
17179869184
>>> a = pa.array(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 296, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 109, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 2147483648
{code}
This "should" be representable using a chunked array with two chunks.  It is 
possible this is the source of your issue.  Or maybe when reading from parquet 
the "fallback to chunked array" logic simply doesn't apply.  I don't know the 
parquet code well enough.  That is one of the reasons it would be helpful to 
have a reproducible test.

It also might be easier to just write your parquet out to multiple files or 
multiple row groups.  Both of these approaches should not only avoid this issue 
but also reduce the memory pressure when you are converting to pandas.

> [Python] Parquet reader cannot read large strings
> -------------------------------------------------
>
>                 Key: ARROW-11456
>                 URL: https://issues.apache.org/jira/browse/ARROW-11456
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>            Reporter: Pac A. He
>            Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
>     df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
>     return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
>     return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
>     return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to