[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5302: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21766 > Memory leak when read_table().to_pandas().to_json() > --------------------------------------------------- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. > Reporter: Jorge Leitão > Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 't': [list(range(0, 180 * 60, 5))] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > table = pq.read_table(path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df['t'].to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 481676 > 618584 > 755396 > 892156 > 1028892 > 1165660 > 1302428 > 1439184 > 1620376 > 1801340 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > > Note: it is not entirely obvious that this is caused by pyarrow instead of > pandas or numpy. I was only able to reproduce it through write/read from > pyarrow. > -- This message was sent by Atlassian Jira (v8.20.10#820010)