Hello all,

@Gang
> Could you please simply describe the layout of DuckDB and Velox

Arrow represents long (>12 bytes) strings with a view which includes
a buffer index (used to look up one of the variadic data buffers)
and an offset (used to find the start of a string's bytes within the
indicated buffer). DuckDB and Velox by contrast have a raw pointer
directly to the start of the string's bytes. Since these occupy the
same 8 bytes of a view, it's possible and fairly efficient to convert
from one representation to the other by modifying those 8 bytes in place.

@Raphael
> Is the motivation here to avoid DuckDB and Velox having to duplicate the
conversion logic from pointer-based to offset-based, or to allow
arrow-cpp to operate directly on pointer-based arrays?

It's more the latter; arrow C++ is intended to be useful as more than an IPC
serializer/deserializer, so it is beneficial to be able to import arrays
and also operate on them with no conversion cost. However it's also worth
noting that the raw pointer representation is more efficient on access,
albeit more expensive to validate along with a number of other tradeoffs.
In order to progress this work, I took this hybrid approach in part to defer
the question of which representation is preferred in which context. I would
like to allow the C++ library freedom to extract as much performance from
this type as possible, internally as well as when communicating with other
engines.

@Antoine
> What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property. I don't think
it's illegal or unreasonable for an implementation to diverge in their
internal handling of arrow data (whether to achieve performance,
consistency, or convenience).

> I'm not sure how DuckDB and Velox data could be exposed, but it could be
for example an extension type with a fixed_size_binary<16> storage type.

This wouldn't allow for the transmission of the variadic data buffers
which (even in the presence of raw pointer views) are necessary to
guarantee the lifetime of string data in the vector. Alternatively we
could use Utf8View with the high and low bits of the raw pointer
packed into the index and offset, but I don't think this would be less
tantamount to an unofficial arrow format.

Sincerely,
Ben Kietzman


On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> What this PR is creating is an "unofficial" Arrow format, with data
> types exposed in Arrow C++ that are not part of the Arrow standard, but
> are exposed as if they were. Most users will probably not read the
> official format spec, but will simply trust the official Arrow
> implementations. So the official Arrow implementations have an
> obligation to faithfully represent the Arrow format and not breed
> confusion.
>
> So I'm -1 on the way the PR presents things currently.
>
> I'm not sure how DuckDB and Velox data could be exposed, but it could be
> for example an extension type with a fixed_size_binary<16> storage type.
>
> Regards
>
> Antoine.
>
>
>
> Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > In the PR to add support for Utf8View to the c++ implementation,
> > I've taken the approach of allowing raw pointer views [1] alongside the
> > index/offset views described in the spec [2]. This was done to ease
> > communication with other engines such as DuckDB and Velox whose native
> > string representation is the raw pointer view. In order to be usable
> > as a utility for writing IPC files and other operations on arrow
> > formatted data, it is useful for the library to be able to directly
> > import raw pointer arrays even when immediately converting these to
> > the index/offset representation.
> >
> > However there has been objection in review [3] since the raw pointer
> > representation is not part of the official format. Since data visitation
> > utilities are generic, IMHO this hybrid approach does not add
> > significantly to the complexity of the C++ library, and I feel the
> > aforementioned interoperability is a high priority when adding this
> > feature to the C++ library. It's worth noting that this interoperability
> > has been a stated goal of the Utf8Type since its original proposal [4]
> > and throughout the discussion of its adoption [5].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]:
> >
> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
> > [2]:
> >
> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
> >
>

Reply via email to