[ https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651169#comment-17651169 ]
Joris Van den Bossche commented on ARROW-18400: ----------------------------------------------- A small reproducible example to illustrate my explanation above: {code:python} # creating a chunked list array that consists of two chunks that are both slices into the same parent array arr = pa.array([[1, 2], [3, 4, 5], [6], [7, 8]]) chunked_arr = pa.chunked_array([arr.slice(0, 2), arr.slice(2, 2)]) # converting this chunked array to numpy np_arr = chunked_arr.to_numpy() # the list array gets converted to a numpy array of numpy arrays. Each element (the nested numpy array) is # a slice of a numpy array of the flat values. We can get this parent flat numpy array through the .base property >>> np_arr[0].base array([[1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8]]) # the flat values are included twice. Comparing to the correct behaviour with original non-chunked array: >>> arr.to_numpy(zero_copy_only=False)[0].base array([[1, 2, 3, 4, 5, 6, 7, 8]]) {code} > [Python] Quadratic memory usage of Table.to_pandas with nested data > ------------------------------------------------------------------- > > Key: ARROW-18400 > URL: https://issues.apache.org/jira/browse/ARROW-18400 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 10.0.1 > Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X > with 64 GB RAM > Reporter: Adam Reeve > Assignee: Alenka Frim > Priority: Critical > Fix For: 11.0.0 > > Attachments: test_memory.py > > > Reading nested Parquet data and then converting it to a Pandas DataFrame > shows quadratic memory usage and will eventually run out of memory for > reasonably small files. I had initially thought this was a regression since > 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks > in at higher row counts. > Example code to generate nested Parquet data: > {code:python} > import numpy as np > import random > import string > import pandas as pd > _characters = string.ascii_uppercase + string.digits + string.punctuation > def make_random_string(N=10): > return ''.join(random.choice(_characters) for _ in range(N)) > nrows = 1_024_000 > filename = 'nested.parquet' > arr_len = 10 > nested_col = [] > for i in range(nrows): > nested_col.append(np.array( > [{ > 'a': None if i % 1000 == 0 else np.random.choice(10000, > size=3).astype(np.int64), > 'b': None if i % 100 == 0 else random.choice(range(100)), > 'c': None if i % 10 == 0 else make_random_string(5) > } for i in range(arr_len)] > )) > df = pd.DataFrame({'c1': nested_col}) > df.to_parquet(filename) > {code} > And then read into a DataFrame with: > {code:python} > import pyarrow.parquet as pq > table = pq.read_table(filename) > df = table.to_pandas() > {code} > Only reading to an Arrow table isn't a problem, it's the to_pandas method > that exhibits the large memory usage. I haven't tested generating nested > Arrow data in memory without writing Parquet from Pandas but I assume the > problem probably isn't Parquet specific. > Memory usage I see when reading different sized files on a machine with 64 GB > RAM: > ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| > |32,000|362|361| > |64,000|531|531| > |128,000|1,152|1,101| > |256,000|2,888|1,402| > |512,000|10,301|3,508| > |1,024,000|38,697|5,313| > |2,048,000|OOM|20,061| > |4,096,000| |OOM| > With Arrow 10.0.1, memory usage approximately quadruples when row count > doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but > then quadruples from 1024k to 2048k rows. > PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something > changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)