Hello,
What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were. Most users will probably not read the
official format spec, but will simply trust the official Arrow
implementations. So the official Arrow implementations have an
obligation to faithfully represent the Arrow format and not breed confusion.
So I'm -1 on the way the PR presents things currently.
I'm not sure how DuckDB and Velox data could be exposed, but it could be
for example an extension type with a fixed_size_binary<16> storage type.
Regards
Antoine.
Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :
Hello all,
In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.
However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].
Sincerely,
Ben Kietzman
[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4