Antoine,

That's a good question. I think there's a critical part that I haven't
articulated well in the doc yet.

When converting from Arrow's columnar format to Rows, you have three
options:

(1) Go through the record batch row-by-row
(2) Iterate through each column of record batch, add column value to each
row
(3) Iterate through smaller sub-batches of the record batch, and do (2) on
each sub batch

The converter would do (3). In cases I've heard of seems to be the most
performant, though I would welcome others' opinions on that. I imagine
there are some "memory locality" benefits, though I am no expert on that.

This is most apparent when you look at the following two methods:

template<T>
class ToRowConverter<T> {
    // This is implemented by subclass
    virtual arrow::Result<std::vector<T>>
Convert(std::shared_ptr<arrow::RecordBatch> batch);
   /// This derived
    arrow::Result<std::vector<T>>
RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t
batch_size);
}

The idea here is that RecordBatchToRows() will convert in smaller slices
dictated by batch_size. A Record Batch with 2 million rows might be
converted 10,000 rows at a time.

I'm going to update the doc to make that clearer, but does what I described
above seem sensible?

Best,
Will Jones



On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello Will,
>
> So the added value would simply be the automatic definition of
> iterator-returning methods? Or am I missing something?
>
> Regards
>
> Antoine.
>
>
> Le 23/03/2022 à 19:36, Will Jones a écrit :
> > Hello Arrow devs,
> >
> > I recently created ARROW-16006 [1] ("Helpers for converting between rows
> > and Arrow objects"), and would appreciate feedback. It's meant for
> > conversion from arbitrary schemas, whereas the existing C++ examples
> > demonstrate fixed schemas (that is, known at compile-time).
> >
> > If you have implemented conversion between Arrow and a row-based data
> > structures in C++ (or tried to): Would these helpers work for your use
> > case? There is an associated draft design doc linked in the issue [2],
> > which is open to comments.
> >
> > Thanks,
> >
> > Will Jones
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-16006
> > [2]
> >
> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing
> >
>

Reply via email to