I believe the motivation is to avoid the cost of the data copy that would
have to happen to convert from a pointer based to offset based scenario.
Allowing the pointer-based implementation will ensure that we can maintain
zero-copy communication with both DuckDB and Velox in a common workflow
scenario.

Converting to the offset-based version would have a cost of having to copy
strings from their locations to contiguous buffers which could end up being
very significant depending on the shape and size of the data. The pointer
-based solution wouldn't be allowed in IPC though, only across the C Data
interface (correct me if I'm wrong).

--Matt

On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> Hi,
>
> Is the motivation here to avoid DuckDB and Velox having to duplicate the
> conversion logic from pointer-based to offset-based, or to allow
> arrow-cpp to operate directly on pointer-based arrays?
>
> If it is the former, I personally wouldn't have thought the conversion
> logic sufficiently complex to really warrant this?
>
> If it is the latter, I wonder if you have some benchmark numbers for
> converting between and operating on the differing representations? In
> the absence of a strong performance case, it's hard in my opinion to
> justify adding what will be an arrow-cpp specific extension that isn't
> part of the standard, with all the potential for confusion and
> interoperability challenges that entails.
>
> Kind Regards,
>
> Raphael
>
> On 26/09/2023 21:34, Benjamin Kietzman wrote:
> > Hello all,
> >
> > In the PR to add support for Utf8View to the c++ implementation,
> > I've taken the approach of allowing raw pointer views [1] alongside the
> > index/offset views described in the spec [2]. This was done to ease
> > communication with other engines such as DuckDB and Velox whose native
> > string representation is the raw pointer view. In order to be usable
> > as a utility for writing IPC files and other operations on arrow
> > formatted data, it is useful for the library to be able to directly
> > import raw pointer arrays even when immediately converting these to
> > the index/offset representation.
> >
> > However there has been objection in review [3] since the raw pointer
> > representation is not part of the official format. Since data visitation
> > utilities are generic, IMHO this hybrid approach does not add
> > significantly to the complexity of the C++ library, and I feel the
> > aforementioned interoperability is a high priority when adding this
> > feature to the C++ library. It's worth noting that this interoperability
> > has been a stated goal of the Utf8Type since its original proposal [4]
> > and throughout the discussion of its adoption [5].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]:
> >
> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
> > [2]:
> >
> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
> >
>

Reply via email to