I believe the motivation is to avoid the cost of the data copy that would have to happen to convert from a pointer based to offset based scenario. Allowing the pointer-based implementation will ensure that we can maintain zero-copy communication with both DuckDB and Velox in a common workflow scenario.
Converting to the offset-based version would have a cost of having to copy strings from their locations to contiguous buffers which could end up being very significant depending on the shape and size of the data. The pointer -based solution wouldn't be allowed in IPC though, only across the C Data interface (correct me if I'm wrong). --Matt On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > Hi, > > Is the motivation here to avoid DuckDB and Velox having to duplicate the > conversion logic from pointer-based to offset-based, or to allow > arrow-cpp to operate directly on pointer-based arrays? > > If it is the former, I personally wouldn't have thought the conversion > logic sufficiently complex to really warrant this? > > If it is the latter, I wonder if you have some benchmark numbers for > converting between and operating on the differing representations? In > the absence of a strong performance case, it's hard in my opinion to > justify adding what will be an arrow-cpp specific extension that isn't > part of the standard, with all the potential for confusion and > interoperability challenges that entails. > > Kind Regards, > > Raphael > > On 26/09/2023 21:34, Benjamin Kietzman wrote: > > Hello all, > > > > In the PR to add support for Utf8View to the c++ implementation, > > I've taken the approach of allowing raw pointer views [1] alongside the > > index/offset views described in the spec [2]. This was done to ease > > communication with other engines such as DuckDB and Velox whose native > > string representation is the raw pointer view. In order to be usable > > as a utility for writing IPC files and other operations on arrow > > formatted data, it is useful for the library to be able to directly > > import raw pointer arrays even when immediately converting these to > > the index/offset representation. > > > > However there has been objection in review [3] since the raw pointer > > representation is not part of the official format. Since data visitation > > utilities are generic, IMHO this hybrid approach does not add > > significantly to the complexity of the C++ library, and I feel the > > aforementioned interoperability is a high priority when adding this > > feature to the C++ library. It's worth noting that this interoperability > > has been a stated goal of the Utf8Type since its original proposal [4] > > and throughout the discussion of its adoption [5]. > > > > Sincerely, > > Ben Kietzman > > > > [1]: > > > https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 > > [2]: > > > https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 > > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 > > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 > > >