On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Wes, > > Le 31/07/2022 à 00:02, Wes McKinney a écrit : > > > > I understand there are still some aspects of this project that cause > > some squeamishness (like having arbitrary memory addresses embedded > > within array values whose lifetime a C ABI consumer may not know about > > -- we already export memory addresses in the C ABI but fewer of them > > because they are only the buffers at the array level). We discussed > > some alternative approaches that address some of these questions, but > > each come with associated trade-offs. > > Are any of these trade-offs blocking? >
They aren't blocking implementation work at least. I think the alternative designs / requirements that were discussed were * Attaching all referenced memory buffers by pointers in the C ABI or * Using offsets into an attached buffer instead of pointers I think that either of these pose conflicts with pooled allocators or tiered buffer management, since a single Arrow vector may reference many buffers within a memory pool (where different vectors may reference different memory chunks in the pool — so externalizing all referenced buffers is burdensome in the first case or would require an expensive "repack" operation in the latter case, defeating the goal of zero copy). You can see a discussion of how Umbra has three different storage tiers (persistent, transient, temporary) for out-of-line strings https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf It might be a good idea to look more carefully at how DuckDB and Velox do memory management for the out-of-line strings. If we start placing restrictions on how the out-of-line string buffers are managed and externalized, it risks undermining the zero-copy interoperability benefits that we're trying to achieve with this.