Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

Yi Cao Wed, 30 Oct 2024 21:52:52 -0700

Hi Jorge,

I find your previous comments here can probably solve my issue. Could you
please provide more insights how this can be achieved equivalently for C++
objects(shared_ptr) putting into arrow table ?


Thanks a lot in advance.

“This use-case seems semantically equivalent with storing python objects in
arrow for the purpose of putting them in an arrow table. This can be
achieved by some form of pickling or indirection (I recall Polars and
others doing one of these).”

Yi

On Fri, 11 Oct 2024 at 21:23, Yi Cao <cao.yi.s...@gmail.com> wrote:

> First of all, thank you so much for your inputs and great insights!
> Integer-Pointer round trip does not seem a reliable way to me. We
> experienced subtle UB in some case before, which is one of the reasons we
> look at Arrow.
>
> Regarding Jorge's 4 options, Option 1-3 is not considered due to
> (de)serialization and deep copy overhead in our case but option 4 is an
> interesting one. I have written test code and proved that compute::Filter
> can work well on such a table (not on that vector-index column) and likely
> a lot more operations. So I think this is a benefit in performance compared
> to other filtering approaches, no? To access those objects on heap, yes,
> there will be overhead to dereference share_ptr and that's the price I
> think we need to pay.
>
> A bit more details - I experimented with "my own" MyRecordBatch - a copy
> of class SimpleRecordBatch from record_batch.cc (I know it's not intended
> for use in this way). So in MyRecordBatch, there is a class member added -
> vector<shared_ptr<T>> holding large data valid and an "Index" column as
> part of schema for accessing object T. I think in most case, Arrow
> operations can work on MyRecordBatch. Is there any risk of doing this ?
>
> Thank you!
>
>
> On Fri, 11 Oct 2024 at 08:45, Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> AFAIK uintptr_t being internally stored as an integer does not make it
>> equivalent to uint64_t - compilers use the type to set them apart, see the
>> example in [1]. ptr2int2ptr can result in UB in subtle ways, due to how
>> C/C++ are specified and translated to LLVM IR.
>>
>> Storying pointers as arrow integers and then casting then back to
>> pointers to read its pointee, seems to fit this ptr2int2ptr. My point
>> being: take this aspect into account when converting pointers to arrow and
>> back.
>>
>> [1] https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
>>
>> On Fri, Oct 11, 2024, 04:12 Aldrin <octalene....@pm.me> wrote:
>>
>>> I'm fairly sure uintptr_t is an integer type for holding a pointer in
>>> C++ (docs specifically say "to void" aka `void*`). It should be equivalent
>>> to uint64_t on 64-bit systems, but where I agree it is risky is that it is
>>> going to be platform dependent and there are likely nuances for certain
>>> compilers or alternate libc implementations (e.g. alpine).
>>>
>>> If what you mean is that it won't roundtrip across memory spaces, then
>>> sure I agree, but I am doubtful a naive shared_ptr would in that case
>>> either.
>>>
>>> If I am wrong about the above then please correct me. To quickly sanity
>>> check myself, it seems that pointer provenance mostly points to the
>>> scenario of doing arithmetic on addresses, not whether an address value can
>>> be type cast to an integer and back again. I am *not* recommending type
>>> casting a pointer to an integer then doing math with it, then casting it
>>> back.
>>>
>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>
>>>
>>> On Thu, Oct 10, 2024 at 08:35, Jorge Cardoso Leitão <
>>> jorgecarlei...@gmail.com
>>> <On+Thu,+Oct+10,+2024+at+08:35,+Jorge+Cardoso+Leitão+%3C%3Ca+href=>>
>>> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>> This use-case seems semantically equivalent with storing python objects
>>> in arrow for the purpose of putting them in an arrow table. This can be
>>> achieved by some form of pickling or indirection (I recall Polars and
>>> others doing one of these).
>>>
>>>
>>> Imo there are different approaches with different tradeoffs:
>>>
>>>
>>> 1. Serialize the objects to an arrow struct data type. Allows to
>>> leverage both arrow kernels and data locatity (I.e. no indirection).
>>> Requires a (deep) copy of the objects into arrow and may require
>>> restructuring a lot of code / use more memory for the transposition of rows
>>> to columns.
>>>
>>>
>>> 2. Store the object serialized in a binary data type. Benefits from
>>> locality, does not benefit from arrow compute kernels, requires deep copy
>>> and most likely a deserialization of the object on every read.
>>>
>>>
>>> 3. Store the pointer to the object as data type binary. Does not benefit
>>> from locality nor arrow kernels. Does not require a deep copy of the data
>>> nor serialization/deserialization cost of the data. Requires
>>> deserialization of the pointer itself per read.
>>>
>>>
>>> 4. Build a vector of pointers, and store the offset as integers in
>>> arrow. Does not benefit from locality (double indirection), does not
>>> benefit from arrow kernels, no deserialization cost per read.
>>>
>>>
>>> NOTE: pointers generally do not round-trip to integers - "cast pointer
>>> to integer back to pointer" is generally undefined behavior (in C/C++ or
>>> Rust), see pointer provenance.
>>>
>>>
>>> Best,
>>>
>>>
>>> Jorge
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 10, 2024, 07:58 Weston Pace <
>>>
>>> weston.p...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> If your goal is to use Arrow to do the computation then having shared
>>>> pointers will not help.  Arrow's computation kernels (filters, selection,
>>>> etc.) are designed to be fast because they run on columns of data.  If you
>>>> have a collection of objects (rows) then there isn't going to be anything
>>>> in Arrow to help you compute on this any better than using std::vector.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Note that the STL has a variety of APIs for filtering and computing on
>>>> std::vector (though maybe Arrow has a friendlier API :)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On the question of specifically storing shared_ptr you will have a
>>>> problem.  You can store the raw pointers (reinterpret as integers or use
>>>> fixed size binary) but when the arrow array is deleted then the shared_ptr
>>>> control structure (the atomic reference counter) will not be decremented.
>>>> Arrow has no concept of a per-value destructor.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I agree with the others that storing shared_ptr in an arrow array is
>>>> not going to be useful.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 9, 2024 at 4:33 PM Aldrin <
>>>>
>>>> octalene....@pm.me> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Hello!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think the main goal you're trying to achieve is to use Arrow for
>>>>> processing some product details (e.g. brand name) in a tabular format
>>>>> without storing the entirety of product details in the table itself.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I would think that you could store all of the product details in Arrow
>>>>> without too much overhead (when you first load it into memory), but I'll
>>>>> not dive into details there since you want to avoid it.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> As Andrew mentioned, you could use a column of vector positions
>>>>> instead of a column of shared_ptr, then use the vector positions to access
>>>>> wherever you're storing your shared pointers. This is similar to a foreign
>>>>> key to a different table.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> An alternate, but delicate (aka real risky), approach could be to
>>>>> store the raw pointer as a column of type uintptr_t (which you might
>>>>> approximate with a uint64_t). There may not be much benefit compared to 
>>>>> the
>>>>> foreign key approach, since you'd have to iterate over the column values
>>>>> and do a type cast in order to dereference the pointer, but it may reduce
>>>>> the hit of an indirect lookup depending on how you're storing your shared
>>>>> pointers.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> # ------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> # Aldrin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/drin/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://gitlab.com/octalene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://keybase.io/octalene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, October 9th, 2024 at 14:12, Andrew Bell <
>>>>>
>>>>> andrew.bell...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > You could give each product an ID number and use that as a proxy.
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > On Wed, Oct 9, 2024 at 5:01 PM Yi Cao
>>>>>
>>>>> cao.yi.s...@gmail.com wrote:
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > Let's take a simple example. No network connection is involved.
>>>>> Say I can have an array table of digital products, which has one column of
>>>>> shared_ptr pointing to a product object allocated on heap. I would like to
>>>>> do filtering on the column "brand" using the value "Samsung". Therefore I
>>>>> can get all rows of "Samsung" products and by accessing the column of
>>>>> shared pointer , I can access details of this product. Without using a
>>>>> shared pointer, I would have to copy the product details into multiple
>>>>> columns of this table. If I save all these shared pointers in a separate
>>>>> vector, then I cannot do filtering like that in the arrow table.
>>>>>
>>>>>
>>>>> > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > The challenge for me is how to store a shared_ptr in a "cell" of
>>>>> an arrow table. It seems to me only the primitive types are supported, but
>>>>> I would like to confirm. I think the "extension" type might help with my
>>>>> scenario but I'm not sure how to make it work. If it's a simple type like
>>>>> integer, I can do IntBuilder to build an array and make a record batch out
>>>>> of it.
>>>>>
>>>>>
>>>>> > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > Hope this provides a bit of clarity. Thank you.
>>>>>
>>>>>
>>>>> > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > On Wed, 9 Oct 2024 at 19:12, Andrew Bell
>>>>>
>>>>> andrew.bell...@gmail.com wrote:
>>>>>
>>>>>
>>>>> > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > > On Wed, Oct 9, 2024, 12:27 PM Yi Cao
>>>>>
>>>>> cao.yi.s...@gmail.com wrote:
>>>>>
>>>>>
>>>>> > > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > > > If I place these shared ptrs in a vector, how can I make this
>>>>> vector saved in Arrow table as a column? Is it possible?
>>>>>
>>>>>
>>>>> > > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > > What do you mean by "saved"?
>>>>>
>>>>>
>>>>> > > >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > > > I don't understand the point of placing shared pointers in an
>>>>> arrow array. It's essentially equivalent to storing the pointers in a
>>>>> vector. You can't write shared pointers to a data store or send them 
>>>>> across
>>>>> a network connection.
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > --
>>>>>
>>>>>
>>>>> > Andrew Bell
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>> andrew.bell...@gmail.com
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

Reply via email to