I don't think there's any benefit in keeping RecordBatch abstract. Making it concrete would probably reduce overhead slightly as well.
People who need abstract tabular data capabilities can still implement Table (though I wonder if that ability has ever been used productively). Regards Antoine. Le 19/02/2021 à 17:17, Benjamin Kietzman a écrit : > Thanks for looking into this, Micah. > > One convention I'd like to append: we mostly avoid convenience typedefs, > but an > exception is the common case of `vector<shared_ptr<{class name}>>`, for > which we > allow and encourage the use of `{class name}Vector` typedefs. (Conversely, > nothing > should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr typedef.) > > 1. Agreed on Array::data(), in fact I made this change in #9490 (ARROW-9196) > > 2. Is it worthwhile to keep RecordBatch abstract? I am only aware of a > single concrete > subclass (SimpleRecordBatch). I agree that RecordBatch's assessors > should avoid > unnecessary construction of shared_ptrs, and demoting it to a concrete > class would > make it clear that this is safe. > > Ben > > On Mon, Feb 15, 2021 at 11:55 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> (Apologies if this is a double send) >> >> I'll open a PR on this soon. To update the dev guide. >> >> Given this standard there are few accessor methods that I think we should >> either convert or create a new accessor that does the correct thing with >> respect to return type. Given how core these methods are I think the >> latter might be a better approach (but I don't feel too strongly if others >> have a good rationale one way or another): >> 1. Array::Data() [1] - In looking at some CPU profiles it seems like most >> of the time spent in Validate is due to shared_ptr >> construction/destruction. In auditing the code this method appears to be >> the only one returning copies. >> >> 2. RecordBatch::Column* [2] - These are more questionable since they are >> virtual methods, it is not clear if dynamic Record batches where the >> intention behind this design. So it might not be worth it. Anecdotally, >> I've known people who have written some naive iteration code use these >> methods where shared_ptr construction/destruction contributed 10% overhead. >> >> Thanks, >> Micah >> >> >> >> [1] >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L163 >> [2] >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L98 >> >> On Mon, Feb 8, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >>> Agreed. We should probably document this in the C++ developer docs. >>> >>> On Mon, Feb 8, 2021 at 12:04 PM Antoine Pitrou <anto...@python.org> >> wrote: >>>> >>>> >>>> Hi Micah, >>>> >>>> That's roughly my mental model as well. >>>> >>>> However, for 4) I would say that return a const ref to shared_ptr if >>>> preferable because the caller will often need the ownership (especially >>>> with Array, ArrayData, DataType, etc.). >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> Le 08/02/2021 à 18:02, Micah Kornfield a écrit : >>>>> I'm not sure how consistent we are with how shared_ptr is used as a >>>>> parameter to methods and as a return type. In reviewing and writing >>> code >>>>> I've been using these guidelines for myself and I was wondering if >> they >>>>> align with others: >>>>> >>>>> 1. If a copy of a shared_ptr is not intended to be made by the >> method >>> then >>>>> use a const ref to underlying type. i.e. void Foo(const Array& >> array) >>> is >>>>> preferable to void Foo(const shared_ptr<Array>& array) [1]. >>>>> >>>>> 2. If a copy is always going to be made pass by value. i.e. void >>>>> Foo(std::shared_ptr<Array>) and to std::move within the method. The >>> last >>>>> time I did research on this allowed for eliminating shared_ptr >>> overhead if >>>>> the caller also can std::move() the parameter. >>>>> >>>>> 3. If a copy might be made pass the shared_ptr by const reference. >>> i.e. void >>>>> Foo(const shared_ptr<T>& array) The exception to this if the contents >>> of >>>>> the shared_ptr a reference can effectively be copied cheaply without >>> as is >>>>> the case with Array via ArrayData in which case #1 applies. >>>>> >>>>> 4. For accessor methods prefer returning by const ref or underlying >>> ref to >>>>> underlying when appropriate. i.e. const std::shared_ptr<Array>& >>> foo() or >>>>> const Array& Foo(). >>>>> >>>>> 5. For factory like methods return a copy i.e. std::shared_ptr<Array> >>>>> MakeFoo(); >>>>> >>>>> Is this other people's mental model? I'd like to update our style >>> guide so >>>>> we can hopefully drive consistency over time. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> [1] Array is somewhat of a special case because one can have >>> essentially >>>>> the same shared_ptr copy semantics by copying the underlying >> ArrayData >>>>> object. >>>>> >>> >> >