For those interested, the PR for this new API is ready for review here: https://github.com/apache/arrow/pull/12775
On Wed, Apr 6, 2022 at 11:17 AM Will Jones <will.jones...@gmail.com> wrote: > Hello, > > I've fleshed out the ideas in the doc in this draft PR: > https://github.com/apache/arrow/pull/12775 > > Feedback on the API design is still welcome. > > Best, > > Will Jones > > On Thu, Mar 24, 2022 at 10:25 AM Will Jones <will.jones...@gmail.com> > wrote: > >> Antoine, >> >> That's a good question. I think there's a critical part that I haven't >> articulated well in the doc yet. >> >> When converting from Arrow's columnar format to Rows, you have three >> options: >> >> (1) Go through the record batch row-by-row >> (2) Iterate through each column of record batch, add column value to each >> row >> (3) Iterate through smaller sub-batches of the record batch, and do (2) >> on each sub batch >> >> The converter would do (3). In cases I've heard of seems to be the most >> performant, though I would welcome others' opinions on that. I imagine >> there are some "memory locality" benefits, though I am no expert on that. >> >> This is most apparent when you look at the following two methods: >> >> template<T> >> class ToRowConverter<T> { >> // This is implemented by subclass >> virtual arrow::Result<std::vector<T>> >> Convert(std::shared_ptr<arrow::RecordBatch> batch); >> /// This derived >> arrow::Result<std::vector<T>> >> RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t >> batch_size); >> } >> >> The idea here is that RecordBatchToRows() will convert in smaller slices >> dictated by batch_size. A Record Batch with 2 million rows might be >> converted 10,000 rows at a time. >> >> I'm going to update the doc to make that clearer, but does what I >> described above seem sensible? >> >> Best, >> Will Jones >> >> >> >> On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <anto...@python.org> >> wrote: >> >>> >>> Hello Will, >>> >>> So the added value would simply be the automatic definition of >>> iterator-returning methods? Or am I missing something? >>> >>> Regards >>> >>> Antoine. >>> >>> >>> Le 23/03/2022 à 19:36, Will Jones a écrit : >>> > Hello Arrow devs, >>> > >>> > I recently created ARROW-16006 [1] ("Helpers for converting between >>> rows >>> > and Arrow objects"), and would appreciate feedback. It's meant for >>> > conversion from arbitrary schemas, whereas the existing C++ examples >>> > demonstrate fixed schemas (that is, known at compile-time). >>> > >>> > If you have implemented conversion between Arrow and a row-based data >>> > structures in C++ (or tried to): Would these helpers work for your use >>> > case? There is an associated draft design doc linked in the issue [2], >>> > which is open to comments. >>> > >>> > Thanks, >>> > >>> > Will Jones >>> > >>> > [1] https://issues.apache.org/jira/browse/ARROW-16006 >>> > [2] >>> > >>> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing >>> > >>> >>