Just for some reference times from my system I created a quick test to
dump a ~1.7GB table to buffer(s).

Going to many buffers (just collecting the buffers): ~11,000ns
Going to one preallocated buffer: ~160,000,000ns
Going to one dynamically allocated buffer (using a grow factor of 2x):
~2,000,000,000ns

On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> To be clear, we would like to help make this faster. I don't recall
> much effort being invested in optimizing this code path in the last
> couple of years, so there may be some low hanging fruit to improve the
> performance. Changing the in-memory data layout (the chunking) is one
> of the most likely things to help.
>
> On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <gosh...@gmail.com> wrote:
> >
> > Hi Jayjeet,
> >
> > I wonder if you really need to serialize the whole table into a single
> > buffer as you will end up with twice the memory while you could be sending
> > chunks as they are generated by the  RecordBatchStreamWriter. Also is the
> > buffer resized beforehand? I'd suspect there might be relocations happening
> > under the hood.
> >
> >
> > Cheers,
> > Gosh
> >
> > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <wesmck...@gmail.com> wrote:
> >
> > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > spent? How many arrays (the sum of the number of chunks across all
> > > columns) in total are there? I would guess that the problem is all the
> > > little Buffer memcopies. I don't think that the C Interface is going
> > > to help you.
> > >
> > > - Wes
> > >
> > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > <jayjeetchakrabort...@gmail.com> wrote:
> > > >
> > > > Hello Arrow Community,
> > > >
> > > > I am a student working on a project where I need to serialize an
> > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> > > currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> > > the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> > > the whole table, and that is harming the performance of my
> > > performance-critical application. I basically want to get hold of the
> > > underlying memory of the table as bytes and send it over the network. How
> > > do you suggest I tackle this problem? I was thinking of using the C Data
> > > interface for this, so that I convert my arrow::Table to ArrowArray and
> > > ArrowSchema and serialize the structs to send them over the network, but
> > > seems like serializing structs is another complex problem on its own.  It
> > > will be great to have some suggestions on this. Thanks a lot.
> > > >
> > > > Best,
> > > > Jayjeet
> > > >
> > >

Reply via email to