hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same semantics re: memory copying, so I don't believe using Plasma will change the outcome
- Wes On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <abdulrahman...@outlook.com> wrote: > > Have you tried using plasma which is effectively what you are trying to do ? > > https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma > > > ________________________________ > From: Bipin Mathew <bipinmat...@gmail.com> > Sent: Friday, September 28, 2018 2:28:54 PM > To: dev@arrow.apache.org > Subject: Help with zero-copy conversion of pyarrow table to pandas dataframe. > > Hello Everyone, > > I am just getting my feet wet with apache arrow and I am running into > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out > a four column, million row apache arrow table to shared memory and I am > attempting to read it into a python dataframe. It is advertised that it is > possible to do this in a zero-copy manner, however, when I run the > to_pandas() method on the table I imported into pyarrow, my memory usage > increases, indicating that it did not actually do a zero-copy conversion. > Here is my code: > > 1 import pyarrow as pa > > 2 import pandas as pd > > 3 import numpy as np > > 4 import time > > 5 > > 6 start = time.time() > > 7 mm=pa.memory_map('/dev/shm/arrow_table') > > 8 b=mm.read_buffer() > > 9 reader = pa.RecordBatchStreamReader(b) > > 10 z = reader.read_all() > > 11 print("reading time: "+str(time.time()-start)) > > 12 > > 13 start = time.time() > > 14 df = z.to_pandas(zero_copy_only=True,use_threads=True) > > 15 print("conversion time: "+str(time.time()-start)) > > > What am I doing wrong here? Or indeed am I simply misunderstanding what is > meant by zero-copy in this context? My frantic google efforts only resulted > in this possibly relevant issue, but it was unclear to me how it was > resolved: > > https://github.com/apache/arrow/issues/1649 > > I am using pyarrow 0.10.0. > > Regards, > > Bipin