Good Evening Abdul, Wes, @Abdul, I agree I could probably use plasma, but I just wanted to get something up and running quickly for prototyping purposes. As @Wes mentioned, I will probably run into the same thing using plasma. I managed to get a little more debugging output. Here is the script that I am running:
(karrow) ➜ test git:(master) ✗ cat test.py > import pyarrow as pa > import pandas as pd > import numpy as np > import time > import os > import psutil > > process = psutil.Process(os.getpid()) > print("Memory before memory_map: "+str(process.memory_info().rss)) > mm=pa.memory_map('/dev/shm/arrow_table','r') > print("Memory after before memory_map: "+str(process.memory_info().rss)) > > print("Memory before reading_buffer: "+str(process.memory_info().rss)) > b=mm.read_buffer() > print("Memory after reading_buffer: "+str(process.memory_info().rss)) > print("buffer size: "+str(b.size)) > > print("Memory before RecordBatchStreamReader: > "+str(process.memory_info().rss)) > reader = pa.RecordBatchStreamReader(b) > print("Memory after RecordBatchStreamReader: > "+str(process.memory_info().rss)) > print("Memory before read_all: "+str(process.memory_info().rss)) > z = reader.read_all() > print("Memory after read_all: "+str(process.memory_info().rss)) > startm = process.memory_info().rss > print("Memory before to_pandas: "+str(startm)) > start=time.time() > df = z.to_pandas(zero_copy_only=True) > dt=time.time()-start > df_size = df.memory_usage().sum() > endm = process.memory_info().rss > print("Memory after to_pandas: "+str(endm)) > print("Difference in memory usage after call to to_pandas: > "+str(endm-startm)) > print("Converted %d byte arrow table to %d byte dataframe at a rate of %f > bytes/sec" % (b.size,df_size,(b.size/dt))) Here is the output: (karrow) ➜ test git:(master) ✗ python3 -i ./test.py > Memory before memory_map: 72962048 > Memory after before memory_map: 72962048 > Memory before reading_buffer: 72962048 > Memory after reading_buffer: 72962048 > buffer size: 2000000476 > Memory before RecordBatchStreamReader: 72962048 > Memory after RecordBatchStreamReader: 72962048 > Memory before read_all: 72962048 > Memory after read_all: 72962048 > Memory before to_pandas: 72962048 > Memory after to_pandas: 4074319872 > Difference in memory usage after call to to_pandas: 4001357824 > Converted 2000000476 byte arrow table to 2000000080 byte dataframe at a > rate of 2615333819.434341 bytes/sec The apache arrow table is comprised of 3 columns >>> z > pyarrow.Table > int: int32 > float: double > bigint: int64 the output dataframe has these types, which look reasonable: >>> df.ftypes > int int32:dense > float float64:dense > bigint int64:dense > dtype: object Over the weekend, I will try to write a self contained reproducible test case, but I thought this may start to give insight on if there is an issue and if so what it maybe. Regards, Bipin On Fri, Sep 28, 2018 at 5:41 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same > semantics re: memory copying, so I don't believe using Plasma will > change the outcome > > - Wes > On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <abdulrahman...@outlook.com> > wrote: > > > > Have you tried using plasma which is effectively what you are trying to > do ? > > > > > https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma > > > > > > ________________________________ > > From: Bipin Mathew <bipinmat...@gmail.com> > > Sent: Friday, September 28, 2018 2:28:54 PM > > To: dev@arrow.apache.org > > Subject: Help with zero-copy conversion of pyarrow table to pandas > dataframe. > > > > Hello Everyone, > > > > I am just getting my feet wet with apache arrow and I am running > into > > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote > out > > a four column, million row apache arrow table to shared memory and I am > > attempting to read it into a python dataframe. It is advertised that it > is > > possible to do this in a zero-copy manner, however, when I run the > > to_pandas() method on the table I imported into pyarrow, my memory usage > > increases, indicating that it did not actually do a zero-copy conversion. > > Here is my code: > > > > 1 import pyarrow as pa > > > 2 import pandas as pd > > > 3 import numpy as np > > > 4 import time > > > 5 > > > 6 start = time.time() > > > 7 mm=pa.memory_map('/dev/shm/arrow_table') > > > 8 b=mm.read_buffer() > > > 9 reader = pa.RecordBatchStreamReader(b) > > > 10 z = reader.read_all() > > > 11 print("reading time: "+str(time.time()-start)) > > > 12 > > > 13 start = time.time() > > > 14 df = z.to_pandas(zero_copy_only=True,use_threads=True) > > > 15 print("conversion time: "+str(time.time()-start)) > > > > > > What am I doing wrong here? Or indeed am I simply misunderstanding what > is > > meant by zero-copy in this context? My frantic google efforts only > resulted > > in this possibly relevant issue, but it was unclear to me how it was > > resolved: > > > > https://github.com/apache/arrow/issues/1649 > > > > I am using pyarrow 0.10.0. > > > > Regards, > > > > Bipin >