hi folks,

I'm interested to start doing some work to implement the "StringView"
memory layout that we previously discussed late last year [1] with
supporting document [2].

Since there's quite a few details to work out, my objective would be
to do the work in a feature branch focused on a few things:

* Establishing more efficient interoperability (zero-copy, in some
cases) with this memory layout between DuckDB, Velox, and Arrow
Datasets / Acero via the C ABI
* Implementing support for reading StringViewArrays from Parquet files
(which should be faster when materializing string data that benefited
greatly from dictionary encoding)
* Implementing some initial kernels that benefit from the StringView type

Of course, to formally accept these changes into the Arrow format,
we'll need to have two reference implementations along with
integration tests. I outlined what I think this process might look
like along with a rough C++ implementation plan:

https://docs.google.com/document/d/1kocVHzEpd-veq2AsoHcsrlaDpezUNOKk2PRzggwAs9w/edit#

I understand there are still some aspects of this project that cause
some squeamishness (like having arbitrary memory addresses embedded
within array values whose lifetime a C ABI consumer may not know about
-- we already export memory addresses in the C ABI but fewer of them
because they are only the buffers at the array level). We discussed
some alternative approaches that address some of these questions, but
each come with associated trade-offs.

I'm hopeful that having some concrete implementation work will help us
look more precisely at some of the open questions around the design
and implementation and enable us to arrive at a satisfactory
consensus.

Thanks,
Wes

[1]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2]: 
https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#

Reply via email to