Jorge created ARROW-5302: ---------------------------- Summary: Memory leak when read_table().to_pandas().to_json(orient='records') Key: ARROW-5302 URL: https://issues.apache.org/jira/browse/ARROW-5302 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Environment: Linux, Python 3.6.4 :: Anaconda, Inc. Reporter: Jorge
The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json(orient='records') print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) -- This message was sent by Atlassian JIRA (v7.6.3#76005)