Wes McKinney created ARROW-4099: ----------------------------------- Summary: [Python] Pretty printing very large ChunkedArray objects can use unbounded memory Key: ARROW-4099 URL: https://issues.apache.org/jira/browse/ARROW-4099 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.13.0
In working on ARROW-2970, I have the following dataset: {code} values = [b'x'] + [ b'x' * (1 << 20) ] * 2 * (1 << 10) arr = np.array(values) arrow_arr = pa.array(arr) {code} The object {{arrow_arr}} has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB: {code} In [10]: rep = repr(arrow_arr) In [11]: len(rep) Out[11]: 637536258 {code} There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this) -- This message was sent by Atlassian JIRA (v7.6.3#76005)