Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Bipin Mathew Wed, 03 Oct 2018 07:09:46 -0700

Good Morning Everyone,

      I have not yet had an opportunity to write a reproducible test case
for this issue, but I am hoping the generous people here can help me with a
more general question. How, fundamentally, are we expected, to copy or
indeed directly write a arrow table to shared memory using the cpp sdk?
Currently, I have an implementation like this:


 77   std::shared_ptr<arrow::Buffer> B;
>  78   std::shared_ptr<arrow::io::BufferOutputStream> buffer;
>  79   std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
>  80   arrow::MemoryPool* pool = arrow::default_memory_pool();
>  81   arrow::io::BufferOutputStream::Create(4096,pool,&buffer);
>  82   std::shared_ptr<arrow::Table> table;
>  83   karrow::ArrowHandle *h;
>  84   h = (karrow::ArrowHandle *)Kj(khandle);
>  85   table = h->table;
>  86
>  87
>  
> arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer);
>  88   writer->WriteTable(*table);
>  89   writer->Close();
>  90   buffer->Finish(&B);
>  91
>  92   // printf("Investigate Memory usage.");
>  93   // getchar();
>  94
>  95
>  96   std::shared_ptr<arrow::io::MemoryMappedFile> mm;
>  97
>  arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm);
>  98   mm->Write(B->data(),B->size());
>  99   mm->Close();


"table" on line 85 is a shared_ptr to a arrow::Table object. As you can see
there, I write to an arrow:Buffer then write that to a memory mapped file.
Is there a more direct approach? I watched this video of a talk @Wes
McKinney gave here:

https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/

Where a method: arrow::MemoryMappedBuffer was referenced, but I have not
seen any documentation regarding this function. Has it been deprecated?

Also, as I mentioned, "table" up there is a arrow::Table object. I create
it columnwise using various arrow::[type]Builder functions. Is there anyway
to actually even write the original table directly into shared memory? Any
guidance on the proper way to do these things would be greatly appreciated.

Regards,

Bipin









On Fri, Sep 28, 2018 at 11:20 PM Bipin Mathew <bipinmat...@gmail.com> wrote:

> Good Evening Abdul, Wes,
>
>      @Abdul, I agree I could probably use plasma, but I just wanted to get
> something up and running quickly for prototyping purposes. As @Wes
> mentioned, I will probably run into the same thing using plasma. I managed
> to get a little more debugging output. Here is the script that I am running:
>
> (karrow) ➜  test git:(master) ✗ cat test.py
>
>
>> import pyarrow as pa
>> import pandas as pd
>> import numpy as np
>> import time
>> import os
>> import psutil
>>
>> process = psutil.Process(os.getpid())
>> print("Memory before memory_map: "+str(process.memory_info().rss))
>> mm=pa.memory_map('/dev/shm/arrow_table','r')
>> print("Memory after before memory_map: "+str(process.memory_info().rss))
>>
>> print("Memory before reading_buffer: "+str(process.memory_info().rss))
>> b=mm.read_buffer()
>> print("Memory after reading_buffer: "+str(process.memory_info().rss))
>> print("buffer size: "+str(b.size))
>>
>> print("Memory before RecordBatchStreamReader:
>> "+str(process.memory_info().rss))
>> reader = pa.RecordBatchStreamReader(b)
>> print("Memory after RecordBatchStreamReader:
>> "+str(process.memory_info().rss))
>> print("Memory before read_all: "+str(process.memory_info().rss))
>> z = reader.read_all()
>> print("Memory after read_all: "+str(process.memory_info().rss))
>> startm = process.memory_info().rss
>> print("Memory before to_pandas: "+str(startm))
>> start=time.time()
>> df = z.to_pandas(zero_copy_only=True)
>> dt=time.time()-start
>> df_size = df.memory_usage().sum()
>> endm = process.memory_info().rss
>> print("Memory after to_pandas: "+str(endm))
>> print("Difference in memory usage after call to to_pandas:
>> "+str(endm-startm))
>> print("Converted %d byte arrow table to %d byte dataframe at a rate of %f
>> bytes/sec" % (b.size,df_size,(b.size/dt)))
>
>
>
> Here is the output:
>
> (karrow) ➜  test git:(master) ✗ python3 -i ./test.py
>> Memory before memory_map: 72962048
>> Memory after before memory_map: 72962048
>> Memory before reading_buffer: 72962048
>> Memory after reading_buffer: 72962048
>> buffer size: 2000000476
>> Memory before RecordBatchStreamReader: 72962048
>> Memory after RecordBatchStreamReader: 72962048
>> Memory before read_all: 72962048
>> Memory after read_all: 72962048
>> Memory before to_pandas: 72962048
>> Memory after to_pandas: 4074319872
>> Difference in memory usage after call to to_pandas: 4001357824
>> Converted 2000000476 byte arrow table to 2000000080 byte dataframe at a
>> rate of 2615333819.434341 bytes/sec
>
>
> The apache arrow table is comprised of 3 columns
>
> >>> z
>> pyarrow.Table
>> int: int32
>> float: double
>> bigint: int64
>
>
> the output dataframe has these types, which look reasonable:
>
> >>> df.ftypes
>> int         int32:dense
>> float     float64:dense
>> bigint      int64:dense
>> dtype: object
>
>
> Over the weekend, I will try to write a self contained reproducible test
> case, but I thought this may start to give insight on if there is an issue
> and if so what it maybe.
>
> Regards,
>
> Bipin
>
>
> On Fri, Sep 28, 2018 at 5:41 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same
>> semantics re: memory copying, so I don't believe using Plasma will
>> change the outcome
>>
>> - Wes
>> On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <abdulrahman...@outlook.com>
>> wrote:
>> >
>> > Have you tried using plasma which is effectively what you are trying to
>> do ?
>> >
>> >
>> https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma
>> >
>> >
>> > ________________________________
>> > From: Bipin Mathew <bipinmat...@gmail.com>
>> > Sent: Friday, September 28, 2018 2:28:54 PM
>> > To: dev@arrow.apache.org
>> > Subject: Help with zero-copy conversion of pyarrow table to pandas
>> dataframe.
>> >
>> > Hello Everyone,
>> >
>> >      I am just getting my feet wet with apache arrow and I am running
>> into
>> > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote
>> out
>> > a four column, million row apache arrow table to shared memory and I am
>> > attempting to read it into a python dataframe. It is advertised that it
>> is
>> > possible to do this in a zero-copy manner, however, when I run the
>> > to_pandas() method on the table I imported into pyarrow, my memory usage
>> > increases, indicating that it did not actually do a zero-copy
>> conversion.
>> > Here is my code:
>> >
>> >   1 import pyarrow as pa
>> > >   2 import pandas as pd
>> > >   3 import numpy as np
>> > >   4 import time
>> > >   5
>> > >   6 start = time.time()
>> > >   7 mm=pa.memory_map('/dev/shm/arrow_table')
>> > >   8 b=mm.read_buffer()
>> > >   9 reader = pa.RecordBatchStreamReader(b)
>> > >  10 z = reader.read_all()
>> > >  11 print("reading time: "+str(time.time()-start))
>> > >  12
>> > >  13 start = time.time()
>> > >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
>> > >  15 print("conversion time: "+str(time.time()-start))
>> >
>> >
>> > What am I doing wrong here? Or indeed am I simply misunderstanding what
>> is
>> > meant by zero-copy in this context? My frantic google efforts only
>> resulted
>> > in this possibly relevant issue, but it was unclear to me how it was
>> > resolved:
>> >
>> > https://github.com/apache/arrow/issues/1649
>> >
>> > I am using pyarrow 0.10.0.
>> >
>> > Regards,
>> >
>> > Bipin
>>
>

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Reply via email to