Wes McKinney created ARROW-4099:
-----------------------------------

             Summary: [Python] Pretty printing very large ChunkedArray objects 
can use unbounded memory
                 Key: ARROW-4099
                 URL: https://issues.apache.org/jira/browse/ARROW-4099
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Wes McKinney
             Fix For: 0.13.0


In working on ARROW-2970, I have the following dataset:

{code}
values = [b'x'] + [
    b'x' * (1 << 20)
] * 2 * (1 << 10)

arr = np.array(values)

arrow_arr = pa.array(arr)
{code}

The object {{arrow_arr}} has 129 chunks, each element of which is 1MB of 
binary. The repr for this object is over 600MB:

{code}
In [10]: rep = repr(arrow_arr)

In [11]: len(rep)
Out[11]: 637536258
{code}

There's probably a number of failsafes we can implement to avoid badness in 
these pathological cases (which may not happen often, but given the kinds of 
bug reports we are seeing, people do have datasets that look like this)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to