Jorge created ARROW-5302:
----------------------------

             Summary: Memory leak when 
read_table().to_pandas().to_json(orient='records')
                 Key: ARROW-5302
                 URL: https://issues.apache.org/jira/browse/ARROW-5302
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.13.0
         Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
            Reporter: Jorge


The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
            'a': ['AA%d' % i for i in range(batches)],
            't': [list(range(0, 180 * 60, 5))] * batches,
            'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
5))),
            'u': ['t'] * batches,
        })

pq.write_table(pa.Table.from_pandas(df), path)

# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
    # comment any of the 2 lines for the leak to vanish.
    df = pq.read_table(path).to_pandas()
    df.to_json(orient='records')
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
785560
1065460
1383532
1607676
1924820
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to