Hi, This use-case seems semantically equivalent with storing python objects in arrow for the purpose of putting them in an arrow table. This can be achieved by some form of pickling or indirection (I recall Polars and others doing one of these).
Imo there are different approaches with different tradeoffs: 1. Serialize the objects to an arrow struct data type. Allows to leverage both arrow kernels and data locatity (I.e. no indirection). Requires a (deep) copy of the objects into arrow and may require restructuring a lot of code / use more memory for the transposition of rows to columns. 2. Store the object serialized in a binary data type. Benefits from locality, does not benefit from arrow compute kernels, requires deep copy and most likely a deserialization of the object on every read. 3. Store the pointer to the object as data type binary. Does not benefit from locality nor arrow kernels. Does not require a deep copy of the data nor serialization/deserialization cost of the data. Requires deserialization of the pointer itself per read. 4. Build a vector of pointers, and store the offset as integers in arrow. Does not benefit from locality (double indirection), does not benefit from arrow kernels, no deserialization cost per read. NOTE: pointers generally do not round-trip to integers - "cast pointer to integer back to pointer" is generally undefined behavior (in C/C++ or Rust), see pointer provenance. Best, Jorge On Thu, Oct 10, 2024, 07:58 Weston Pace <weston.p...@gmail.com> wrote: > If your goal is to use Arrow to do the computation then having shared > pointers will not help. Arrow's computation kernels (filters, selection, > etc.) are designed to be fast because they run on columns of data. If you > have a collection of objects (rows) then there isn't going to be anything > in Arrow to help you compute on this any better than using std::vector. > > Note that the STL has a variety of APIs for filtering and computing on > std::vector (though maybe Arrow has a friendlier API :) > > On the question of specifically storing shared_ptr you will have a > problem. You can store the raw pointers (reinterpret as integers or use > fixed size binary) but when the arrow array is deleted then the shared_ptr > control structure (the atomic reference counter) will not be decremented. > Arrow has no concept of a per-value destructor. > > I agree with the others that storing shared_ptr in an arrow array is not > going to be useful. > > On Wed, Oct 9, 2024 at 4:33 PM Aldrin <octalene....@pm.me> wrote: > >> Hello! >> >> I think the main goal you're trying to achieve is to use Arrow for >> processing some product details (e.g. brand name) in a tabular format >> without storing the entirety of product details in the table itself. >> >> I would think that you could store all of the product details in Arrow >> without too much overhead (when you first load it into memory), but I'll >> not dive into details there since you want to avoid it. >> >> As Andrew mentioned, you could use a column of vector positions instead >> of a column of shared_ptr, then use the vector positions to access wherever >> you're storing your shared pointers. This is similar to a foreign key to a >> different table. >> >> An alternate, but delicate (aka real risky), approach could be to store >> the raw pointer as a column of type uintptr_t (which you might approximate >> with a uint64_t). There may not be much benefit compared to the foreign key >> approach, since you'd have to iterate over the column values and do a type >> cast in order to dereference the pointer, but it may reduce the hit of an >> indirect lookup depending on how you're storing your shared pointers. >> >> >> >> >> # ------------------------------ >> >> # Aldrin >> >> >> https://github.com/drin/ >> >> https://gitlab.com/octalene >> >> https://keybase.io/octalene >> >> >> On Wednesday, October 9th, 2024 at 14:12, Andrew Bell < >> andrew.bell...@gmail.com> wrote: >> >> > You could give each product an ID number and use that as a proxy. >> > >> >> > On Wed, Oct 9, 2024 at 5:01 PM Yi Cao cao.yi.s...@gmail.com wrote: >> > >> >> > > Let's take a simple example. No network connection is involved. Say I >> can have an array table of digital products, which has one column of >> shared_ptr pointing to a product object allocated on heap. I would like to >> do filtering on the column "brand" using the value "Samsung". Therefore I >> can get all rows of "Samsung" products and by accessing the column of >> shared pointer , I can access details of this product. Without using a >> shared pointer, I would have to copy the product details into multiple >> columns of this table. If I save all these shared pointers in a separate >> vector, then I cannot do filtering like that in the arrow table. >> > > >> >> > > The challenge for me is how to store a shared_ptr in a "cell" of an >> arrow table. It seems to me only the primitive types are supported, but I >> would like to confirm. I think the "extension" type might help with my >> scenario but I'm not sure how to make it work. If it's a simple type like >> integer, I can do IntBuilder to build an array and make a record batch out >> of it. >> > > >> >> > > Hope this provides a bit of clarity. Thank you. >> > > >> >> > > On Wed, 9 Oct 2024 at 19:12, Andrew Bell andrew.bell...@gmail.com >> wrote: >> > > >> >> > > > On Wed, Oct 9, 2024, 12:27 PM Yi Cao cao.yi.s...@gmail.com wrote: >> > > > >> >> > > > > If I place these shared ptrs in a vector, how can I make this >> vector saved in Arrow table as a column? Is it possible? >> > > > >> >> > > > What do you mean by "saved"? >> > > > >> >> > > > I don't understand the point of placing shared pointers in an arrow >> array. It's essentially equivalent to storing the pointers in a vector. You >> can't write shared pointers to a data store or send them across a network >> connection. >> > >> >> > >> >> > >> >> > >> >> > -- >> > Andrew Bell >> > andrew.bell...@gmail.com > >