I'm fairly sure uintptr_t is an integer type for holding a pointer in C++ (docs 
specifically say "to void" aka `void*`). It should be equivalent to uint64_t on 
64-bit systems, but where I agree it is risky is that it is going to be 
platform dependent and there are likely nuances for certain compilers or 
alternate libc implementations (e.g. alpine). If what you mean is that it won't 
roundtrip across memory spaces, then sure I agree, but I am doubtful a naive 
shared_ptr would in that case either.
If I am wrong about the above then please correct me. To quickly sanity check 
myself, it seems that pointer provenance mostly points to the scenario of doing 
arithmetic on addresses, not whether an address value can be type cast to an 
integer and back again. I am *not* recommending type casting a pointer to an 
integer then doing math with it, then casting it back.
 Sent from Proton Mail for iOS 
On Thu, Oct 10, 2024 at 08:35, Jorge Cardoso Leitão 
<jorgecarlei...@gmail.com> wrote:   Hi,  This use-case seems semantically 
equivalent with storing python objects in arrow for the purpose of putting them 
in an arrow table. This can be achieved by some form of pickling or indirection 
(I recall Polars and others doing one of these).
 Imo there are different approaches with different tradeoffs:
 1. Serialize the objects to an arrow struct data type. Allows to leverage both 
arrow kernels and data locatity (I.e. no indirection). Requires a (deep) copy 
of the objects into arrow and may require restructuring a lot of code / use 
more memory for the transposition of rows to columns.  2. Store the object 
serialized in a binary data type. Benefits from locality, does not benefit from 
arrow compute kernels, requires deep copy and most likely a deserialization of 
the object on every read.
 3. Store the pointer to the object as data type binary. Does not benefit from 
locality nor arrow kernels. Does not require a deep copy of the data nor 
serialization/deserialization cost of the data. Requires deserialization of the 
pointer itself per read.
 4. Build a vector of pointers, and store the offset as integers in arrow. Does 
not benefit from locality (double indirection), does not benefit from arrow 
kernels, no deserialization cost per read.
 NOTE: pointers generally do not round-trip to integers - "cast pointer to 
integer back to pointer" is generally undefined behavior (in C/C++ or Rust), 
see pointer provenance.
 Best,
 Jorge
 
      On Thu, Oct 10, 2024, 07:58 Weston Pace <
   weston.p...@gmail.com> wrote:
   
           If your goal is to use Arrow to do the computation then having 
shared pointers will not help.  Arrow's computation kernels (filters, 
selection, etc.) are designed to be fast because they run on columns of data.  
If you have a collection of objects (rows) then there isn't going to be 
anything in Arrow to help you compute on this any better than using std::vector.
         
             Note that the STL has a variety of APIs for filtering and 
computing on std::vector (though maybe Arrow has a friendlier API :)
             
             On the question of specifically storing shared_ptr you will have a 
problem.  You can store the raw pointers (reinterpret as integers or use fixed 
size binary) but when the arrow array is deleted then the shared_ptr control 
structure (the atomic reference counter) will not be decremented.  Arrow has no 
concept of a per-value destructor.
             
             I agree with the others that storing shared_ptr in an arrow array 
is not going to be useful.
          
            On Wed, Oct 9, 2024 at 4:33 PM Aldrin <
     octalene....@pm.me> wrote:
     
             Hello!
            I think the main goal you're trying to achieve is to use Arrow for 
processing some product details (e.g. brand name) in a tabular format without 
storing the entirety of product details in the table itself.
            I would think that you could store all of the product details in 
Arrow without too much overhead (when you first load it into memory), but I'll 
not dive into details there since you want to avoid it.
            As Andrew mentioned, you could use a column of vector positions 
instead of a column of shared_ptr, then use the vector positions to access 
wherever you're storing your shared pointers. This is similar to a foreign key 
to a different table.
            An alternate, but delicate (aka real risky), approach could be to 
store the raw pointer as a column of type uintptr_t (which you might 
approximate with a uint64_t). There may not be much benefit compared to the 
foreign key approach, since you'd have to iterate over the column values and do 
a type cast in order to dereference the pointer, but it may reduce the hit of 
an indirect lookup depending on how you're storing your shared pointers.
                              # ------------------------------
            # Aldrin
                       https://github.com/drin/
                 https://gitlab.com/octalene
                 https://keybase.io/octalene
                  On Wednesday, October 9th, 2024 at 14:12, Andrew Bell <
     andrew.bell...@gmail.com> wrote:
            > You could give each product an ID number and use that as a 
proxy.
      >             > On Wed, Oct 9, 2024 at 5:01 PM Yi Cao      
cao.yi.s...@gmail.com wrote:
      >             > > Let's take a simple example. No network 
connection is involved. Say I can have an array table of digital products, 
which has one column of shared_ptr pointing to a product object allocated on 
heap. I would like to do filtering on the column "brand" using the value 
"Samsung". Therefore I can get all rows of "Samsung" products and by accessing 
the column of shared pointer , I can access details of this product. Without 
using a shared pointer, I would have to copy the product details into multiple 
columns of this table. If I save all these shared pointers in a separate 
vector, then I cannot do filtering like that in the arrow table.
      > >             > > The challenge for me is how to store a 
shared_ptr in a "cell" of an arrow table. It seems to me only the primitive 
types are supported, but I would like to confirm. I think the "extension" type 
might help with my scenario but I'm not sure how to make it work. If it's a 
simple type like integer, I can do IntBuilder to build an array and make a 
record batch out of it.
      > >             > > Hope this provides a bit of clarity. 
Thank you.
      > >             > > On Wed, 9 Oct 2024 at 19:12, Andrew Bell  
    andrew.bell...@gmail.com wrote:
      > >             > > > On Wed, Oct 9, 2024, 12:27 PM Yi Cao 
     cao.yi.s...@gmail.com wrote:
      > > >             > > > > If I place these shared 
ptrs in a vector, how can I make this vector saved in Arrow table as a column? 
Is it possible?
      > > >             > > > What do you mean by "saved"?
      > > >             > > > I don't understand the point of 
placing shared pointers in an arrow array. It's essentially equivalent to 
storing the pointers in a vector. You can't write shared pointers to a data 
store or send them across a network connection.
      >             >             >             >             > 
--
      > Andrew Bell
      >      andrew.bell...@gmail.com

Attachment: signature.asc
Description: OpenPGP digital signature



Reply via email to