Good Evening Abdul, Wes,

     @Abdul, I agree I could probably use plasma, but I just wanted to get
something up and running quickly for prototyping purposes. As @Wes
mentioned, I will probably run into the same thing using plasma. I managed
to get a little more debugging output. Here is the script that I am running:

(karrow) ➜  test git:(master) ✗ cat test.py


> import pyarrow as pa
> import pandas as pd
> import numpy as np
> import time
> import os
> import psutil
>
> process = psutil.Process(os.getpid())
> print("Memory before memory_map: "+str(process.memory_info().rss))
> mm=pa.memory_map('/dev/shm/arrow_table','r')
> print("Memory after before memory_map: "+str(process.memory_info().rss))
>
> print("Memory before reading_buffer: "+str(process.memory_info().rss))
> b=mm.read_buffer()
> print("Memory after reading_buffer: "+str(process.memory_info().rss))
> print("buffer size: "+str(b.size))
>
> print("Memory before RecordBatchStreamReader:
> "+str(process.memory_info().rss))
> reader = pa.RecordBatchStreamReader(b)
> print("Memory after RecordBatchStreamReader:
> "+str(process.memory_info().rss))
> print("Memory before read_all: "+str(process.memory_info().rss))
> z = reader.read_all()
> print("Memory after read_all: "+str(process.memory_info().rss))
> startm = process.memory_info().rss
> print("Memory before to_pandas: "+str(startm))
> start=time.time()
> df = z.to_pandas(zero_copy_only=True)
> dt=time.time()-start
> df_size = df.memory_usage().sum()
> endm = process.memory_info().rss
> print("Memory after to_pandas: "+str(endm))
> print("Difference in memory usage after call to to_pandas:
> "+str(endm-startm))
> print("Converted %d byte arrow table to %d byte dataframe at a rate of %f
> bytes/sec" % (b.size,df_size,(b.size/dt)))



Here is the output:

(karrow) ➜  test git:(master) ✗ python3 -i ./test.py
> Memory before memory_map: 72962048
> Memory after before memory_map: 72962048
> Memory before reading_buffer: 72962048
> Memory after reading_buffer: 72962048
> buffer size: 2000000476
> Memory before RecordBatchStreamReader: 72962048
> Memory after RecordBatchStreamReader: 72962048
> Memory before read_all: 72962048
> Memory after read_all: 72962048
> Memory before to_pandas: 72962048
> Memory after to_pandas: 4074319872
> Difference in memory usage after call to to_pandas: 4001357824
> Converted 2000000476 byte arrow table to 2000000080 byte dataframe at a
> rate of 2615333819.434341 bytes/sec


The apache arrow table is comprised of 3 columns

>>> z
> pyarrow.Table
> int: int32
> float: double
> bigint: int64


the output dataframe has these types, which look reasonable:

>>> df.ftypes
> int         int32:dense
> float     float64:dense
> bigint      int64:dense
> dtype: object


Over the weekend, I will try to write a self contained reproducible test
case, but I thought this may start to give insight on if there is an issue
and if so what it maybe.

Regards,

Bipin


On Fri, Sep 28, 2018 at 5:41 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same
> semantics re: memory copying, so I don't believe using Plasma will
> change the outcome
>
> - Wes
> On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <abdulrahman...@outlook.com>
> wrote:
> >
> > Have you tried using plasma which is effectively what you are trying to
> do ?
> >
> >
> https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma
> >
> >
> > ________________________________
> > From: Bipin Mathew <bipinmat...@gmail.com>
> > Sent: Friday, September 28, 2018 2:28:54 PM
> > To: dev@arrow.apache.org
> > Subject: Help with zero-copy conversion of pyarrow table to pandas
> dataframe.
> >
> > Hello Everyone,
> >
> >      I am just getting my feet wet with apache arrow and I am running
> into
> > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote
> out
> > a four column, million row apache arrow table to shared memory and I am
> > attempting to read it into a python dataframe. It is advertised that it
> is
> > possible to do this in a zero-copy manner, however, when I run the
> > to_pandas() method on the table I imported into pyarrow, my memory usage
> > increases, indicating that it did not actually do a zero-copy conversion.
> > Here is my code:
> >
> >   1 import pyarrow as pa
> > >   2 import pandas as pd
> > >   3 import numpy as np
> > >   4 import time
> > >   5
> > >   6 start = time.time()
> > >   7 mm=pa.memory_map('/dev/shm/arrow_table')
> > >   8 b=mm.read_buffer()
> > >   9 reader = pa.RecordBatchStreamReader(b)
> > >  10 z = reader.read_all()
> > >  11 print("reading time: "+str(time.time()-start))
> > >  12
> > >  13 start = time.time()
> > >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
> > >  15 print("conversion time: "+str(time.time()-start))
> >
> >
> > What am I doing wrong here? Or indeed am I simply misunderstanding what
> is
> > meant by zero-copy in this context? My frantic google efforts only
> resulted
> > in this possibly relevant issue, but it was unclear to me how it was
> > resolved:
> >
> > https://github.com/apache/arrow/issues/1649
> >
> > I am using pyarrow 0.10.0.
> >
> > Regards,
> >
> > Bipin
>

Reply via email to