Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

Yi Cao Fri, 11 Oct 2024 13:23:34 -0700

First of all, thank you so much for your inputs and great insights!
Integer-Pointer round trip does not seem a reliable way to me. We
experienced subtle UB in some case before, which is one of the reasons we
look at Arrow.


Regarding Jorge's 4 options, Option 1-3 is not considered due to
(de)serialization and deep copy overhead in our case but option 4 is an
interesting one. I have written test code and proved that compute::Filter
can work well on such a table (not on that vector-index column) and likely
a lot more operations. So I think this is a benefit in performance compared
to other filtering approaches, no? To access those objects on heap, yes,
there will be overhead to dereference share_ptr and that's the price I
think we need to pay.

A bit more details - I experimented with "my own" MyRecordBatch - a copy of
class SimpleRecordBatch from record_batch.cc (I know it's not intended for
use in this way). So in MyRecordBatch, there is a class member added -
vector<shared_ptr<T>> holding large data valid and an "Index" column as
part of schema for accessing object T. I think in most case, Arrow
operations can work on MyRecordBatch. Is there any risk of doing this ?

Thank you!


On Fri, 11 Oct 2024 at 08:45, Jorge Cardoso Leitão <jorgecarlei...@gmail.com>
wrote:

> AFAIK uintptr_t being internally stored as an integer does not make it
> equivalent to uint64_t - compilers use the type to set them apart, see the
> example in [1]. ptr2int2ptr can result in UB in subtle ways, due to how
> C/C++ are specified and translated to LLVM IR.
>
> Storying pointers as arrow integers and then casting then back to pointers
> to read its pointee, seems to fit this ptr2int2ptr. My point being: take
> this aspect into account when converting pointers to arrow and back.
>
> [1] https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
>
> On Fri, Oct 11, 2024, 04:12 Aldrin <octalene....@pm.me> wrote:
>
>> I'm fairly sure uintptr_t is an integer type for holding a pointer in C++
>> (docs specifically say "to void" aka `void*`). It should be equivalent to
>> uint64_t on 64-bit systems, but where I agree it is risky is that it is
>> going to be platform dependent and there are likely nuances for certain
>> compilers or alternate libc implementations (e.g. alpine).
>>
>> If what you mean is that it won't roundtrip across memory spaces, then
>> sure I agree, but I am doubtful a naive shared_ptr would in that case
>> either.
>>
>> If I am wrong about the above then please correct me. To quickly sanity
>> check myself, it seems that pointer provenance mostly points to the
>> scenario of doing arithmetic on addresses, not whether an address value can
>> be type cast to an integer and back again. I am *not* recommending type
>> casting a pointer to an integer then doing math with it, then casting it
>> back.
>>
>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>
>>
>> On Thu, Oct 10, 2024 at 08:35, Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com
>> <On+Thu,+Oct+10,+2024+at+08:35,+Jorge+Cardoso+Leitão+%3C%3Ca+href=>>
>> wrote:
>>
>> Hi,
>>
>> This use-case seems semantically equivalent with storing python objects
>> in arrow for the purpose of putting them in an arrow table. This can be
>> achieved by some form of pickling or indirection (I recall Polars and
>> others doing one of these).
>>
>> Imo there are different approaches with different tradeoffs:
>>
>> 1. Serialize the objects to an arrow struct data type. Allows to leverage
>> both arrow kernels and data locatity (I.e. no indirection). Requires a
>> (deep) copy of the objects into arrow and may require restructuring a lot
>> of code / use more memory for the transposition of rows to columns.
>>
>> 2. Store the object serialized in a binary data type. Benefits from
>> locality, does not benefit from arrow compute kernels, requires deep copy
>> and most likely a deserialization of the object on every read.
>>
>> 3. Store the pointer to the object as data type binary. Does not benefit
>> from locality nor arrow kernels. Does not require a deep copy of the data
>> nor serialization/deserialization cost of the data. Requires
>> deserialization of the pointer itself per read.
>>
>> 4. Build a vector of pointers, and store the offset as integers in arrow.
>> Does not benefit from locality (double indirection), does not benefit from
>> arrow kernels, no deserialization cost per read.
>>
>> NOTE: pointers generally do not round-trip to integers - "cast pointer to
>> integer back to pointer" is generally undefined behavior (in C/C++ or
>> Rust), see pointer provenance.
>>
>> Best,
>>
>> Jorge
>>
>> On Thu, Oct 10, 2024, 07:58 Weston Pace < weston.p...@gmail.com> wrote:
>>
>>> If your goal is to use Arrow to do the computation then having shared
>>> pointers will not help.  Arrow's computation kernels (filters, selection,
>>> etc.) are designed to be fast because they run on columns of data.  If you
>>> have a collection of objects (rows) then there isn't going to be anything
>>> in Arrow to help you compute on this any better than using std::vector.
>>>
>>> Note that the STL has a variety of APIs for filtering and computing on
>>> std::vector (though maybe Arrow has a friendlier API :)
>>>
>>> On the question of specifically storing shared_ptr you will have a
>>> problem.  You can store the raw pointers (reinterpret as integers or use
>>> fixed size binary) but when the arrow array is deleted then the shared_ptr
>>> control structure (the atomic reference counter) will not be decremented.
>>> Arrow has no concept of a per-value destructor.
>>>
>>> I agree with the others that storing shared_ptr in an arrow array is not
>>> going to be useful.
>>>
>>> On Wed, Oct 9, 2024 at 4:33 PM Aldrin < octalene....@pm.me> wrote:
>>>
>>>> Hello!
>>>>
>>>> I think the main goal you're trying to achieve is to use Arrow for
>>>> processing some product details (e.g. brand name) in a tabular format
>>>> without storing the entirety of product details in the table itself.
>>>>
>>>> I would think that you could store all of the product details in Arrow
>>>> without too much overhead (when you first load it into memory), but I'll
>>>> not dive into details there since you want to avoid it.
>>>>
>>>> As Andrew mentioned, you could use a column of vector positions instead
>>>> of a column of shared_ptr, then use the vector positions to access wherever
>>>> you're storing your shared pointers. This is similar to a foreign key to a
>>>> different table.
>>>>
>>>> An alternate, but delicate (aka real risky), approach could be to store
>>>> the raw pointer as a column of type uintptr_t (which you might approximate
>>>> with a uint64_t). There may not be much benefit compared to the foreign key
>>>> approach, since you'd have to iterate over the column values and do a type
>>>> cast in order to dereference the pointer, but it may reduce the hit of an
>>>> indirect lookup depending on how you're storing your shared pointers.
>>>>
>>>>
>>>>
>>>>
>>>> # ------------------------------
>>>>
>>>> # Aldrin
>>>>
>>>>
>>>> https://github.com/drin/
>>>>
>>>> https://gitlab.com/octalene
>>>>
>>>> https://keybase.io/octalene
>>>>
>>>>
>>>> On Wednesday, October 9th, 2024 at 14:12, Andrew Bell <
>>>> andrew.bell...@gmail.com> wrote:
>>>>
>>>> > You could give each product an ID number and use that as a proxy.
>>>> >
>>>>
>>>> > On Wed, Oct 9, 2024 at 5:01 PM Yi Cao cao.yi.s...@gmail.com wrote:
>>>> >
>>>>
>>>> > > Let's take a simple example. No network connection is involved. Say
>>>> I can have an array table of digital products, which has one column of
>>>> shared_ptr pointing to a product object allocated on heap. I would like to
>>>> do filtering on the column "brand" using the value "Samsung". Therefore I
>>>> can get all rows of "Samsung" products and by accessing the column of
>>>> shared pointer , I can access details of this product. Without using a
>>>> shared pointer, I would have to copy the product details into multiple
>>>> columns of this table. If I save all these shared pointers in a separate
>>>> vector, then I cannot do filtering like that in the arrow table.
>>>> > >
>>>>
>>>> > > The challenge for me is how to store a shared_ptr in a "cell" of an
>>>> arrow table. It seems to me only the primitive types are supported, but I
>>>> would like to confirm. I think the "extension" type might help with my
>>>> scenario but I'm not sure how to make it work. If it's a simple type like
>>>> integer, I can do IntBuilder to build an array and make a record batch out
>>>> of it.
>>>> > >
>>>>
>>>> > > Hope this provides a bit of clarity. Thank you.
>>>> > >
>>>>
>>>> > > On Wed, 9 Oct 2024 at 19:12, Andrew Bell andrew.bell...@gmail.com
>>>> wrote:
>>>> > >
>>>>
>>>> > > > On Wed, Oct 9, 2024, 12:27 PM Yi Cao cao.yi.s...@gmail.com
>>>> wrote:
>>>> > > >
>>>>
>>>> > > > > If I place these shared ptrs in a vector, how can I make this
>>>> vector saved in Arrow table as a column? Is it possible?
>>>> > > >
>>>>
>>>> > > > What do you mean by "saved"?
>>>> > > >
>>>>
>>>> > > > I don't understand the point of placing shared pointers in an
>>>> arrow array. It's essentially equivalent to storing the pointers in a
>>>> vector. You can't write shared pointers to a data store or send them across
>>>> a network connection.
>>>> >
>>>>
>>>> >
>>>>
>>>> >
>>>>
>>>> >
>>>>
>>>> > --
>>>> > Andrew Bell
>>>> > andrew.bell...@gmail.com
>>>
>>>

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

Reply via email to