On Fri, Feb 19, 2021 at 10:30 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > > Agreed on Array::data(), in fact I made this change in #9490 (ARROW-9196) > > Thank you! > > > > One convention I'd like to append: we mostly avoid convenience typedefs, > > but an > > exception is the common case of `vector<shared_ptr<{class name}>>`, for > > which we > > allow and encourage the use of `{class name}Vector` typedefs. (Conversely, > > nothing > > should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr typedef.) > > I'm -0 on this. I think it makes the code base slightly less readable for > new-comers. > > People who need abstract tabular data capabilities can still implement > > Table (though I wonder if that ability has ever been used productively). > > I think there is an open PR proposing a dataframe like wrapper around table > or record batch (I would need to go back and look). I would have to go > back and look to see if it made use of either of these.
On this I would caution against preemptively closing doors — this codebase still has a lot of room to grow, and given how early we still are on the analytics / query processing / data frames side of the project I would suggest waiting until there has been an opportunity for progress to happen here and we see whether these abstract interfaces are actually needed. For example, if a Table / RecordBatch façade is provided for lazy/unevaluated operations, that could be really valuable in practice. I regret that haven't been able to do more myself to move that forward but I trust we will get there eventually. > -Micah > > > > On Fri, Feb 19, 2021 at 8:25 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > I don't think there's any benefit in keeping RecordBatch abstract. > > Making it concrete would probably reduce overhead slightly as well. > > > > People who need abstract tabular data capabilities can still implement > > Table (though I wonder if that ability has ever been used productively). > > > > Regards > > > > Antoine. > > > > > > Le 19/02/2021 à 17:17, Benjamin Kietzman a écrit : > > > Thanks for looking into this, Micah. > > > > > > One convention I'd like to append: we mostly avoid convenience typedefs, > > > but an > > > exception is the common case of `vector<shared_ptr<{class name}>>`, for > > > which we > > > allow and encourage the use of `{class name}Vector` typedefs. > > (Conversely, > > > nothing > > > should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr > > typedef.) > > > > > > 1. Agreed on Array::data(), in fact I made this change in #9490 > > (ARROW-9196) > > > > > > 2. Is it worthwhile to keep RecordBatch abstract? I am only aware of a > > > single concrete > > > subclass (SimpleRecordBatch). I agree that RecordBatch's assessors > > > should avoid > > > unnecessary construction of shared_ptrs, and demoting it to a concrete > > > class would > > > make it clear that this is safe. > > > > > > Ben > > > > > > On Mon, Feb 15, 2021 at 11:55 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > >> (Apologies if this is a double send) > > >> > > >> I'll open a PR on this soon. To update the dev guide. > > >> > > >> Given this standard there are few accessor methods that I think we > > should > > >> either convert or create a new accessor that does the correct thing with > > >> respect to return type. Given how core these methods are I think the > > >> latter might be a better approach (but I don't feel too strongly if > > others > > >> have a good rationale one way or another): > > >> 1. Array::Data() [1] - In looking at some CPU profiles it seems like > > most > > >> of the time spent in Validate is due to shared_ptr > > >> construction/destruction. In auditing the code this method appears to > > be > > >> the only one returning copies. > > >> > > >> 2. RecordBatch::Column* [2] - These are more questionable since they are > > >> virtual methods, it is not clear if dynamic Record batches where the > > >> intention behind this design. So it might not be worth it. Anecdotally, > > >> I've known people who have written some naive iteration code use these > > >> methods where shared_ptr construction/destruction contributed 10% > > overhead. > > >> > > >> Thanks, > > >> Micah > > >> > > >> > > >> > > >> [1] > > >> > > >> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L163 > > >> [2] > > >> > > >> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L98 > > >> > > >> On Mon, Feb 8, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >> > > >>> Agreed. We should probably document this in the C++ developer docs. > > >>> > > >>> On Mon, Feb 8, 2021 at 12:04 PM Antoine Pitrou <anto...@python.org> > > >> wrote: > > >>>> > > >>>> > > >>>> Hi Micah, > > >>>> > > >>>> That's roughly my mental model as well. > > >>>> > > >>>> However, for 4) I would say that return a const ref to shared_ptr if > > >>>> preferable because the caller will often need the ownership > > (especially > > >>>> with Array, ArrayData, DataType, etc.). > > >>>> > > >>>> Regards > > >>>> > > >>>> Antoine. > > >>>> > > >>>> > > >>>> Le 08/02/2021 à 18:02, Micah Kornfield a écrit : > > >>>>> I'm not sure how consistent we are with how shared_ptr is used as a > > >>>>> parameter to methods and as a return type. In reviewing and writing > > >>> code > > >>>>> I've been using these guidelines for myself and I was wondering if > > >> they > > >>>>> align with others: > > >>>>> > > >>>>> 1. If a copy of a shared_ptr is not intended to be made by the > > >> method > > >>> then > > >>>>> use a const ref to underlying type. i.e. void Foo(const Array& > > >> array) > > >>> is > > >>>>> preferable to void Foo(const shared_ptr<Array>& array) [1]. > > >>>>> > > >>>>> 2. If a copy is always going to be made pass by value. i.e. void > > >>>>> Foo(std::shared_ptr<Array>) and to std::move within the method. The > > >>> last > > >>>>> time I did research on this allowed for eliminating shared_ptr > > >>> overhead if > > >>>>> the caller also can std::move() the parameter. > > >>>>> > > >>>>> 3. If a copy might be made pass the shared_ptr by const reference. > > >>> i.e. void > > >>>>> Foo(const shared_ptr<T>& array) The exception to this if the contents > > >>> of > > >>>>> the shared_ptr a reference can effectively be copied cheaply without > > >>> as is > > >>>>> the case with Array via ArrayData in which case #1 applies. > > >>>>> > > >>>>> 4. For accessor methods prefer returning by const ref or underlying > > >>> ref to > > >>>>> underlying when appropriate. i.e. const std::shared_ptr<Array>& > > >>> foo() or > > >>>>> const Array& Foo(). > > >>>>> > > >>>>> 5. For factory like methods return a copy i.e. std::shared_ptr<Array> > > >>>>> MakeFoo(); > > >>>>> > > >>>>> Is this other people's mental model? I'd like to update our style > > >>> guide so > > >>>>> we can hopefully drive consistency over time. > > >>>>> > > >>>>> Thanks, > > >>>>> Micah > > >>>>> > > >>>>> [1] Array is somewhat of a special case because one can have > > >>> essentially > > >>>>> the same shared_ptr copy semantics by copying the underlying > > >> ArrayData > > >>>>> object. > > >>>>> > > >>> > > >> > > > > >