I don't think there's any benefit in keeping RecordBatch abstract.
Making it concrete would probably reduce overhead slightly as well.

People who need abstract tabular data capabilities can still implement
Table (though I wonder if that ability has ever been used productively).

Regards

Antoine.


Le 19/02/2021 à 17:17, Benjamin Kietzman a écrit :
> Thanks for looking into this, Micah.
> 
> One convention I'd like to append: we mostly avoid convenience typedefs,
> but an
> exception is the common case of `vector<shared_ptr<{class name}>>`, for
> which we
> allow and encourage the use of `{class name}Vector` typedefs. (Conversely,
> nothing
> should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr typedef.)
> 
> 1. Agreed on Array::data(), in fact I made this change in #9490 (ARROW-9196)
> 
> 2. Is it worthwhile to keep RecordBatch abstract? I am only aware of a
> single concrete
>    subclass (SimpleRecordBatch). I agree that RecordBatch's assessors
> should avoid
>    unnecessary construction of shared_ptrs, and demoting it to a concrete
> class would
>    make it clear that this is safe.
> 
> Ben
> 
> On Mon, Feb 15, 2021 at 11:55 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> (Apologies if this is a double send)
>>
>> I'll open a PR on this soon. To update the dev guide.
>>
>> Given this standard there are few accessor methods that I think we should
>> either convert or create a new accessor that does the correct thing with
>> respect to return type.  Given how core these methods are I think the
>> latter might be a better approach (but I don't feel too strongly if others
>> have a good rationale one way or another):
>> 1. Array::Data() [1] - In looking at some CPU profiles it seems like most
>> of the time spent in Validate is due to shared_ptr
>> construction/destruction.  In auditing the code this method appears to be
>> the only one returning copies.
>>
>> 2. RecordBatch::Column* [2] - These are more questionable since they are
>> virtual methods, it is not clear if dynamic Record batches where the
>> intention behind this design. So it might not be worth it.  Anecdotally,
>> I've known people who have written some naive iteration code use these
>> methods where shared_ptr construction/destruction contributed 10% overhead.
>>
>> Thanks,
>> Micah
>>
>>
>>
>> [1]
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L163
>> [2]
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L98
>>
>> On Mon, Feb 8, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> Agreed. We should probably document this in the C++ developer docs.
>>>
>>> On Mon, Feb 8, 2021 at 12:04 PM Antoine Pitrou <anto...@python.org>
>> wrote:
>>>>
>>>>
>>>> Hi Micah,
>>>>
>>>> That's roughly my mental model as well.
>>>>
>>>> However, for 4) I would say that return a const ref to shared_ptr if
>>>> preferable because the caller will often need the ownership (especially
>>>> with Array, ArrayData, DataType, etc.).
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 08/02/2021 à 18:02, Micah Kornfield a écrit :
>>>>> I'm not sure how consistent we are with how shared_ptr is used as a
>>>>> parameter to methods and as a return type.  In reviewing and writing
>>> code
>>>>> I've been using these guidelines for myself and I was wondering if
>> they
>>>>> align with others:
>>>>>
>>>>> 1.  If a copy of a shared_ptr is not intended to be made by the
>> method
>>> then
>>>>> use a const ref to underlying type.  i.e. void Foo(const Array&
>> array)
>>> is
>>>>> preferable to void Foo(const shared_ptr<Array>& array) [1].
>>>>>
>>>>> 2.  If a copy is always going to be made pass by value.  i.e. void
>>>>> Foo(std::shared_ptr<Array>) and to std::move within the method.  The
>>> last
>>>>> time I did research on this allowed for eliminating shared_ptr
>>> overhead if
>>>>> the caller also can std::move() the parameter.
>>>>>
>>>>> 3.  If a copy might be made pass the shared_ptr by const reference.
>>> i.e. void
>>>>> Foo(const shared_ptr<T>& array) The exception to this if the contents
>>> of
>>>>> the shared_ptr a reference can effectively be copied cheaply without
>>> as is
>>>>> the case with Array via ArrayData in which case #1 applies.
>>>>>
>>>>> 4. For accessor methods prefer returning by const ref or underlying
>>> ref to
>>>>> underlying when appropriate.     i.e. const std::shared_ptr<Array>&
>>> foo()  or
>>>>> const Array& Foo().
>>>>>
>>>>> 5. For factory like methods return a copy i.e. std::shared_ptr<Array>
>>>>> MakeFoo();
>>>>>
>>>>> Is this other people's mental model?  I'd like to update our style
>>> guide so
>>>>> we can hopefully drive consistency over time.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> [1] Array is somewhat of a special case because one can have
>>> essentially
>>>>> the same shared_ptr copy semantics by copying the underlying
>> ArrayData
>>>>> object.
>>>>>
>>>
>>
> 

Reply via email to