Hello Everyone, I am just getting my feet wet with apache arrow and I am running into a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out a four column, million row apache arrow table to shared memory and I am attempting to read it into a python dataframe. It is advertised that it is possible to do this in a zero-copy manner, however, when I run the to_pandas() method on the table I imported into pyarrow, my memory usage increases, indicating that it did not actually do a zero-copy conversion. Here is my code:
1 import pyarrow as pa > 2 import pandas as pd > 3 import numpy as np > 4 import time > 5 > 6 start = time.time() > 7 mm=pa.memory_map('/dev/shm/arrow_table') > 8 b=mm.read_buffer() > 9 reader = pa.RecordBatchStreamReader(b) > 10 z = reader.read_all() > 11 print("reading time: "+str(time.time()-start)) > 12 > 13 start = time.time() > 14 df = z.to_pandas(zero_copy_only=True,use_threads=True) > 15 print("conversion time: "+str(time.time()-start)) What am I doing wrong here? Or indeed am I simply misunderstanding what is meant by zero-copy in this context? My frantic google efforts only resulted in this possibly relevant issue, but it was unclear to me how it was resolved: https://github.com/apache/arrow/issues/1649 I am using pyarrow 0.10.0. Regards, Bipin