Very cool! In addition to performance mentioned above, I could see this being useful for the R bindings - we already have a global string pool and a mechanism for keeping a vector of them alive.
I don't see the C Data interface in the PR although I may have missed it - is that a part of the proposal? It seems like it would be possible to use raw pointers as long as they can be guaranteed to be valid until the release callback is called? On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak <ja...@voltrondata.com.invalid> wrote: > > Hello Everyone, > I think keeping interoperability with the large ecosystem is a very > important goal for arrow so I am overall in favor of this proposal! > > You mention benchmarks multiple times, are these results published > somewhere? > > Thanks > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hello all, > > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox > > compatible > > "string view" type could bring several performance benefits to access and > > authoring of string data in the arrow format [2]. Additionally better > > interoperability with engines already using this format could be > > established. > > > > PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and > > to > > the IPC format. For the purposes of IPC raw pointers are not used. Instead, > > each view contains a pair of 32 bit unsigned integers which encode the > > index of > > a character buffer (string view arrays may consist of a variable number of > > such buffers) and the offset of a view's data within that buffer > > respectively. > > Benefits of this substitution include: > > - This makes explicit the guarantee that lifetime of all character data is > > equal > > to that of the array which views it, which is critical for confident > > consumption across an interface boundary. > > - As with other types in the arrow format, such arrays are serializable and > > venue agnostic; directly usable in shared memory without modification. > > - Indices and offsets are easily validated. > > > > Accessing the data requires some trivial pointer arithmetic, but in > > benchmarking > > this had negligible impact on sequential access and only minor impact on > > random > > access. > > > > In the C++ implementation, raw pointer string views are supported as an > > extended > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`. > > Branching on > > this access pattern bit at the data type level has negligible impact on > > performance since the branch resides outside any hot loops. Utility > > functions > > are provided for efficient (potentially in-place) conversion between raw > > pointer > > and index offset views. For example, the C++ implementation could zero copy > > a raw pointer array from Velox, filter it, then convert to index/offset for > > serialization. Other implementations may choose to accommodate or eschew > > raw > > pointer views as their communities direct. > > > > Where desirous in a rigorously controlled context this still enables > > construction > > and safe consumption of string view arrays which reference memory not > > directly bound to the lifetime of the array. I'm not sure when or if we > > would > > find it useful to have arrays like this; I do not introduce any in [3]. I > > mention > > this possibility to highlight that if benchmarking demonstrates that such > > an > > approach brings a significant performance benefit to some operation, the > > only > > barrier to its adoption would be code review. Likewise if more intensive > > benchmarking determines that raw pointer views critically outperform > > index/offset > > views for real-world analytics tasks, prioritizing raw pointer string views > > for usage within the C++ implementation will be straightforward. > > > > See also the proposal to Velox that their string view vector be refactored > > in a similar vein [4]. > > > > Sincerely, > > Ben Kietzman > > > > [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > > [3] https://github.com/apache/arrow/pull/35628 > > [4] https://github.com/facebookincubator/velox/discussions/4362 > >