First of all, thank you so much for your inputs and great insights! Integer-Pointer round trip does not seem a reliable way to me. We experienced subtle UB in some case before, which is one of the reasons we look at Arrow.
Regarding Jorge's 4 options, Option 1-3 is not considered due to (de)serialization and deep copy overhead in our case but option 4 is an interesting one. I have written test code and proved that compute::Filter can work well on such a table (not on that vector-index column) and likely a lot more operations. So I think this is a benefit in performance compared to other filtering approaches, no? To access those objects on heap, yes, there will be overhead to dereference share_ptr and that's the price I think we need to pay. A bit more details - I experimented with "my own" MyRecordBatch - a copy of class SimpleRecordBatch from record_batch.cc (I know it's not intended for use in this way). So in MyRecordBatch, there is a class member added - vector<shared_ptr<T>> holding large data valid and an "Index" column as part of schema for accessing object T. I think in most case, Arrow operations can work on MyRecordBatch. Is there any risk of doing this ? Thank you! On Fri, 11 Oct 2024 at 08:45, Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > AFAIK uintptr_t being internally stored as an integer does not make it > equivalent to uint64_t - compilers use the type to set them apart, see the > example in [1]. ptr2int2ptr can result in UB in subtle ways, due to how > C/C++ are specified and translated to LLVM IR. > > Storying pointers as arrow integers and then casting then back to pointers > to read its pointee, seems to fit this ptr2int2ptr. My point being: take > this aspect into account when converting pointers to arrow and back. > > [1] https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html > > On Fri, Oct 11, 2024, 04:12 Aldrin <octalene....@pm.me> wrote: > >> I'm fairly sure uintptr_t is an integer type for holding a pointer in C++ >> (docs specifically say "to void" aka `void*`). It should be equivalent to >> uint64_t on 64-bit systems, but where I agree it is risky is that it is >> going to be platform dependent and there are likely nuances for certain >> compilers or alternate libc implementations (e.g. alpine). >> >> If what you mean is that it won't roundtrip across memory spaces, then >> sure I agree, but I am doubtful a naive shared_ptr would in that case >> either. >> >> If I am wrong about the above then please correct me. To quickly sanity >> check myself, it seems that pointer provenance mostly points to the >> scenario of doing arithmetic on addresses, not whether an address value can >> be type cast to an integer and back again. I am *not* recommending type >> casting a pointer to an integer then doing math with it, then casting it >> back. >> >> Sent from Proton Mail <https://proton.me/mail/home> for iOS >> >> >> On Thu, Oct 10, 2024 at 08:35, Jorge Cardoso Leitão < >> jorgecarlei...@gmail.com >> <On+Thu,+Oct+10,+2024+at+08:35,+Jorge+Cardoso+Leitão+%3C%3Ca+href=>> >> wrote: >> >> Hi, >> >> This use-case seems semantically equivalent with storing python objects >> in arrow for the purpose of putting them in an arrow table. This can be >> achieved by some form of pickling or indirection (I recall Polars and >> others doing one of these). >> >> Imo there are different approaches with different tradeoffs: >> >> 1. Serialize the objects to an arrow struct data type. Allows to leverage >> both arrow kernels and data locatity (I.e. no indirection). Requires a >> (deep) copy of the objects into arrow and may require restructuring a lot >> of code / use more memory for the transposition of rows to columns. >> >> 2. Store the object serialized in a binary data type. Benefits from >> locality, does not benefit from arrow compute kernels, requires deep copy >> and most likely a deserialization of the object on every read. >> >> 3. Store the pointer to the object as data type binary. Does not benefit >> from locality nor arrow kernels. Does not require a deep copy of the data >> nor serialization/deserialization cost of the data. Requires >> deserialization of the pointer itself per read. >> >> 4. Build a vector of pointers, and store the offset as integers in arrow. >> Does not benefit from locality (double indirection), does not benefit from >> arrow kernels, no deserialization cost per read. >> >> NOTE: pointers generally do not round-trip to integers - "cast pointer to >> integer back to pointer" is generally undefined behavior (in C/C++ or >> Rust), see pointer provenance. >> >> Best, >> >> Jorge >> >> On Thu, Oct 10, 2024, 07:58 Weston Pace < weston.p...@gmail.com> wrote: >> >>> If your goal is to use Arrow to do the computation then having shared >>> pointers will not help. Arrow's computation kernels (filters, selection, >>> etc.) are designed to be fast because they run on columns of data. If you >>> have a collection of objects (rows) then there isn't going to be anything >>> in Arrow to help you compute on this any better than using std::vector. >>> >>> Note that the STL has a variety of APIs for filtering and computing on >>> std::vector (though maybe Arrow has a friendlier API :) >>> >>> On the question of specifically storing shared_ptr you will have a >>> problem. You can store the raw pointers (reinterpret as integers or use >>> fixed size binary) but when the arrow array is deleted then the shared_ptr >>> control structure (the atomic reference counter) will not be decremented. >>> Arrow has no concept of a per-value destructor. >>> >>> I agree with the others that storing shared_ptr in an arrow array is not >>> going to be useful. >>> >>> On Wed, Oct 9, 2024 at 4:33 PM Aldrin < octalene....@pm.me> wrote: >>> >>>> Hello! >>>> >>>> I think the main goal you're trying to achieve is to use Arrow for >>>> processing some product details (e.g. brand name) in a tabular format >>>> without storing the entirety of product details in the table itself. >>>> >>>> I would think that you could store all of the product details in Arrow >>>> without too much overhead (when you first load it into memory), but I'll >>>> not dive into details there since you want to avoid it. >>>> >>>> As Andrew mentioned, you could use a column of vector positions instead >>>> of a column of shared_ptr, then use the vector positions to access wherever >>>> you're storing your shared pointers. This is similar to a foreign key to a >>>> different table. >>>> >>>> An alternate, but delicate (aka real risky), approach could be to store >>>> the raw pointer as a column of type uintptr_t (which you might approximate >>>> with a uint64_t). There may not be much benefit compared to the foreign key >>>> approach, since you'd have to iterate over the column values and do a type >>>> cast in order to dereference the pointer, but it may reduce the hit of an >>>> indirect lookup depending on how you're storing your shared pointers. >>>> >>>> >>>> >>>> >>>> # ------------------------------ >>>> >>>> # Aldrin >>>> >>>> >>>> https://github.com/drin/ >>>> >>>> https://gitlab.com/octalene >>>> >>>> https://keybase.io/octalene >>>> >>>> >>>> On Wednesday, October 9th, 2024 at 14:12, Andrew Bell < >>>> andrew.bell...@gmail.com> wrote: >>>> >>>> > You could give each product an ID number and use that as a proxy. >>>> > >>>> >>>> > On Wed, Oct 9, 2024 at 5:01 PM Yi Cao cao.yi.s...@gmail.com wrote: >>>> > >>>> >>>> > > Let's take a simple example. No network connection is involved. Say >>>> I can have an array table of digital products, which has one column of >>>> shared_ptr pointing to a product object allocated on heap. I would like to >>>> do filtering on the column "brand" using the value "Samsung". Therefore I >>>> can get all rows of "Samsung" products and by accessing the column of >>>> shared pointer , I can access details of this product. Without using a >>>> shared pointer, I would have to copy the product details into multiple >>>> columns of this table. If I save all these shared pointers in a separate >>>> vector, then I cannot do filtering like that in the arrow table. >>>> > > >>>> >>>> > > The challenge for me is how to store a shared_ptr in a "cell" of an >>>> arrow table. It seems to me only the primitive types are supported, but I >>>> would like to confirm. I think the "extension" type might help with my >>>> scenario but I'm not sure how to make it work. If it's a simple type like >>>> integer, I can do IntBuilder to build an array and make a record batch out >>>> of it. >>>> > > >>>> >>>> > > Hope this provides a bit of clarity. Thank you. >>>> > > >>>> >>>> > > On Wed, 9 Oct 2024 at 19:12, Andrew Bell andrew.bell...@gmail.com >>>> wrote: >>>> > > >>>> >>>> > > > On Wed, Oct 9, 2024, 12:27 PM Yi Cao cao.yi.s...@gmail.com >>>> wrote: >>>> > > > >>>> >>>> > > > > If I place these shared ptrs in a vector, how can I make this >>>> vector saved in Arrow table as a column? Is it possible? >>>> > > > >>>> >>>> > > > What do you mean by "saved"? >>>> > > > >>>> >>>> > > > I don't understand the point of placing shared pointers in an >>>> arrow array. It's essentially equivalent to storing the pointers in a >>>> vector. You can't write shared pointers to a data store or send them across >>>> a network connection. >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > -- >>>> > Andrew Bell >>>> > andrew.bell...@gmail.com >>> >>>