On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi Wes,
>
> Le 31/07/2022 à 00:02, Wes McKinney a écrit :
> >
> > I understand there are still some aspects of this project that cause
> > some squeamishness (like having arbitrary memory addresses embedded
> > within array values whose lifetime a C ABI consumer may not know about
> > -- we already export memory addresses in the C ABI but fewer of them
> > because they are only the buffers at the array level). We discussed
> > some alternative approaches that address some of these questions, but
> > each come with associated trade-offs.
>
> Are any of these trade-offs blocking?
>

They aren't blocking implementation work at least.

I think the alternative designs / requirements that were discussed were

* Attaching all referenced memory buffers by pointers in the C ABI or
* Using offsets into an attached buffer instead of pointers

I think that either of these pose conflicts with pooled allocators or
tiered buffer management, since a single Arrow vector may reference
many buffers within a memory pool (where different vectors may
reference different memory chunks in the pool — so externalizing all
referenced buffers is burdensome in the first case or would require an
expensive "repack" operation in the latter case, defeating the goal of
zero copy).

You can see a discussion of how Umbra has three different storage
tiers (persistent, transient, temporary) for out-of-line strings

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

It might be a good idea to look more carefully at how DuckDB and Velox
do memory management for the out-of-line strings.

If we start placing restrictions on how the out-of-line string buffers
are managed and externalized, it risks undermining the zero-copy
interoperability benefits that we're trying to achieve with this.

Reply via email to