On Fri, Feb 19, 2021 at 10:30 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> >
> > Agreed on Array::data(), in fact I made this change in #9490 (ARROW-9196)
>
> Thank you!
>
>
> > One convention I'd like to append: we mostly avoid convenience typedefs,
> > but an
> > exception is the common case of `vector<shared_ptr<{class name}>>`, for
> > which we
> > allow and encourage the use of `{class name}Vector` typedefs. (Conversely,
> > nothing
> > should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr typedef.)
>
> I'm -0 on this.  I think it makes the code base slightly less readable for
> new-comers.
>
> People who need abstract tabular data capabilities can still implement
> > Table (though I wonder if that ability has ever been used productively).
>
> I think there is an open PR proposing a dataframe like wrapper around table
> or record batch (I would need to go back and look).  I would have to go
> back and look to see if it made use of either of these.

On this I would caution against preemptively closing doors — this
codebase still has a lot of room to grow, and given how early we still
are on the analytics / query processing / data frames side of the
project I would suggest waiting until there has been an opportunity
for progress to happen here and we see whether these abstract
interfaces are actually needed. For example, if a Table / RecordBatch
façade is provided for lazy/unevaluated operations, that could be
really valuable in practice.

I regret that haven't been able to do more myself to move that forward
but I trust we will get there eventually.

> -Micah
>
>
>
> On Fri, Feb 19, 2021 at 8:25 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I don't think there's any benefit in keeping RecordBatch abstract.
> > Making it concrete would probably reduce overhead slightly as well.
> >
> > People who need abstract tabular data capabilities can still implement
> > Table (though I wonder if that ability has ever been used productively).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 19/02/2021 à 17:17, Benjamin Kietzman a écrit :
> > > Thanks for looking into this, Micah.
> > >
> > > One convention I'd like to append: we mostly avoid convenience typedefs,
> > > but an
> > > exception is the common case of `vector<shared_ptr<{class name}>>`, for
> > > which we
> > > allow and encourage the use of `{class name}Vector` typedefs.
> > (Conversely,
> > > nothing
> > > should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr
> > typedef.)
> > >
> > > 1. Agreed on Array::data(), in fact I made this change in #9490
> > (ARROW-9196)
> > >
> > > 2. Is it worthwhile to keep RecordBatch abstract? I am only aware of a
> > > single concrete
> > >    subclass (SimpleRecordBatch). I agree that RecordBatch's assessors
> > > should avoid
> > >    unnecessary construction of shared_ptrs, and demoting it to a concrete
> > > class would
> > >    make it clear that this is safe.
> > >
> > > Ben
> > >
> > > On Mon, Feb 15, 2021 at 11:55 PM Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > >
> > >> (Apologies if this is a double send)
> > >>
> > >> I'll open a PR on this soon. To update the dev guide.
> > >>
> > >> Given this standard there are few accessor methods that I think we
> > should
> > >> either convert or create a new accessor that does the correct thing with
> > >> respect to return type.  Given how core these methods are I think the
> > >> latter might be a better approach (but I don't feel too strongly if
> > others
> > >> have a good rationale one way or another):
> > >> 1. Array::Data() [1] - In looking at some CPU profiles it seems like
> > most
> > >> of the time spent in Validate is due to shared_ptr
> > >> construction/destruction.  In auditing the code this method appears to
> > be
> > >> the only one returning copies.
> > >>
> > >> 2. RecordBatch::Column* [2] - These are more questionable since they are
> > >> virtual methods, it is not clear if dynamic Record batches where the
> > >> intention behind this design. So it might not be worth it.  Anecdotally,
> > >> I've known people who have written some naive iteration code use these
> > >> methods where shared_ptr construction/destruction contributed 10%
> > overhead.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >>
> > >> [1]
> > >>
> > >>
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L163
> > >> [2]
> > >>
> > >>
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L98
> > >>
> > >> On Mon, Feb 8, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >>
> > >>> Agreed. We should probably document this in the C++ developer docs.
> > >>>
> > >>> On Mon, Feb 8, 2021 at 12:04 PM Antoine Pitrou <anto...@python.org>
> > >> wrote:
> > >>>>
> > >>>>
> > >>>> Hi Micah,
> > >>>>
> > >>>> That's roughly my mental model as well.
> > >>>>
> > >>>> However, for 4) I would say that return a const ref to shared_ptr if
> > >>>> preferable because the caller will often need the ownership
> > (especially
> > >>>> with Array, ArrayData, DataType, etc.).
> > >>>>
> > >>>> Regards
> > >>>>
> > >>>> Antoine.
> > >>>>
> > >>>>
> > >>>> Le 08/02/2021 à 18:02, Micah Kornfield a écrit :
> > >>>>> I'm not sure how consistent we are with how shared_ptr is used as a
> > >>>>> parameter to methods and as a return type.  In reviewing and writing
> > >>> code
> > >>>>> I've been using these guidelines for myself and I was wondering if
> > >> they
> > >>>>> align with others:
> > >>>>>
> > >>>>> 1.  If a copy of a shared_ptr is not intended to be made by the
> > >> method
> > >>> then
> > >>>>> use a const ref to underlying type.  i.e. void Foo(const Array&
> > >> array)
> > >>> is
> > >>>>> preferable to void Foo(const shared_ptr<Array>& array) [1].
> > >>>>>
> > >>>>> 2.  If a copy is always going to be made pass by value.  i.e. void
> > >>>>> Foo(std::shared_ptr<Array>) and to std::move within the method.  The
> > >>> last
> > >>>>> time I did research on this allowed for eliminating shared_ptr
> > >>> overhead if
> > >>>>> the caller also can std::move() the parameter.
> > >>>>>
> > >>>>> 3.  If a copy might be made pass the shared_ptr by const reference.
> > >>> i.e. void
> > >>>>> Foo(const shared_ptr<T>& array) The exception to this if the contents
> > >>> of
> > >>>>> the shared_ptr a reference can effectively be copied cheaply without
> > >>> as is
> > >>>>> the case with Array via ArrayData in which case #1 applies.
> > >>>>>
> > >>>>> 4. For accessor methods prefer returning by const ref or underlying
> > >>> ref to
> > >>>>> underlying when appropriate.     i.e. const std::shared_ptr<Array>&
> > >>> foo()  or
> > >>>>> const Array& Foo().
> > >>>>>
> > >>>>> 5. For factory like methods return a copy i.e. std::shared_ptr<Array>
> > >>>>> MakeFoo();
> > >>>>>
> > >>>>> Is this other people's mental model?  I'd like to update our style
> > >>> guide so
> > >>>>> we can hopefully drive consistency over time.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Micah
> > >>>>>
> > >>>>> [1] Array is somewhat of a special case because one can have
> > >>> essentially
> > >>>>> the same shared_ptr copy semantics by copying the underlying
> > >> ArrayData
> > >>>>> object.
> > >>>>>
> > >>>
> > >>
> > >
> >

Reply via email to