hi Bipin, There are narrow circumstances where zero-copy pandas deserialization is possible. Firstly, I noted that we are short of documentation for Table.to_pandas, so I opened
https://issues.apache.org/jira/browse/ARROW-3356 It's possible there's a bug when zero_copy_only=True -- it is supposed to raise if any memory allocations are required. Can you give more information what you mean by "my memory usage increases". Did it increase by the footprint of the underlying memory? A minimal reproducible example would help us investigate further. Thanks, Wes On Fri, Sep 28, 2018 at 5:29 PM Bipin Mathew <bipinmat...@gmail.com> wrote: > > Hello Everyone, > > I am just getting my feet wet with apache arrow and I am running into > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out > a four column, million row apache arrow table to shared memory and I am > attempting to read it into a python dataframe. It is advertised that it is > possible to do this in a zero-copy manner, however, when I run the > to_pandas() method on the table I imported into pyarrow, my memory usage > increases, indicating that it did not actually do a zero-copy conversion. > Here is my code: > > 1 import pyarrow as pa > > 2 import pandas as pd > > 3 import numpy as np > > 4 import time > > 5 > > 6 start = time.time() > > 7 mm=pa.memory_map('/dev/shm/arrow_table') > > 8 b=mm.read_buffer() > > 9 reader = pa.RecordBatchStreamReader(b) > > 10 z = reader.read_all() > > 11 print("reading time: "+str(time.time()-start)) > > 12 > > 13 start = time.time() > > 14 df = z.to_pandas(zero_copy_only=True,use_threads=True) > > 15 print("conversion time: "+str(time.time()-start)) > > > What am I doing wrong here? Or indeed am I simply misunderstanding what is > meant by zero-copy in this context? My frantic google efforts only resulted > in this possibly relevant issue, but it was unclear to me how it was > resolved: > > https://github.com/apache/arrow/issues/1649 > > I am using pyarrow 0.10.0. > > Regards, > > Bipin