Hello all, @Gang > Could you please simply describe the layout of DuckDB and Velox
Arrow represents long (>12 bytes) strings with a view which includes a buffer index (used to look up one of the variadic data buffers) and an offset (used to find the start of a string's bytes within the indicated buffer). DuckDB and Velox by contrast have a raw pointer directly to the start of the string's bytes. Since these occupy the same 8 bytes of a view, it's possible and fairly efficient to convert from one representation to the other by modifying those 8 bytes in place. @Raphael > Is the motivation here to avoid DuckDB and Velox having to duplicate the conversion logic from pointer-based to offset-based, or to allow arrow-cpp to operate directly on pointer-based arrays? It's more the latter; arrow C++ is intended to be useful as more than an IPC serializer/deserializer, so it is beneficial to be able to import arrays and also operate on them with no conversion cost. However it's also worth noting that the raw pointer representation is more efficient on access, albeit more expensive to validate along with a number of other tradeoffs. In order to progress this work, I took this hybrid approach in part to defer the question of which representation is preferred in which context. I would like to allow the C++ library freedom to extract as much performance from this type as possible, internally as well as when communicating with other engines. @Antoine > What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. We already do this in every implementation of the arrow format I'm aware of: it's more convenient to consider dictionary as a data type even though the spec says that it is a field property. I don't think it's illegal or unreasonable for an implementation to diverge in their internal handling of arrow data (whether to achieve performance, consistency, or convenience). > I'm not sure how DuckDB and Velox data could be exposed, but it could be for example an extension type with a fixed_size_binary<16> storage type. This wouldn't allow for the transmission of the variadic data buffers which (even in the presence of raw pointer views) are necessary to guarantee the lifetime of string data in the vector. Alternatively we could use Utf8View with the high and low bits of the raw pointer packed into the index and offset, but I don't think this would be less tantamount to an unofficial arrow format. Sincerely, Ben Kietzman On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > What this PR is creating is an "unofficial" Arrow format, with data > types exposed in Arrow C++ that are not part of the Arrow standard, but > are exposed as if they were. Most users will probably not read the > official format spec, but will simply trust the official Arrow > implementations. So the official Arrow implementations have an > obligation to faithfully represent the Arrow format and not breed > confusion. > > So I'm -1 on the way the PR presents things currently. > > I'm not sure how DuckDB and Velox data could be exposed, but it could be > for example an extension type with a fixed_size_binary<16> storage type. > > Regards > > Antoine. > > > > Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit : > > Hello all, > > > > In the PR to add support for Utf8View to the c++ implementation, > > I've taken the approach of allowing raw pointer views [1] alongside the > > index/offset views described in the spec [2]. This was done to ease > > communication with other engines such as DuckDB and Velox whose native > > string representation is the raw pointer view. In order to be usable > > as a utility for writing IPC files and other operations on arrow > > formatted data, it is useful for the library to be able to directly > > import raw pointer arrays even when immediately converting these to > > the index/offset representation. > > > > However there has been objection in review [3] since the raw pointer > > representation is not part of the official format. Since data visitation > > utilities are generic, IMHO this hybrid approach does not add > > significantly to the complexity of the C++ library, and I feel the > > aforementioned interoperability is a high priority when adding this > > feature to the C++ library. It's worth noting that this interoperability > > has been a stated goal of the Utf8Type since its original proposal [4] > > and throughout the discussion of its adoption [5]. > > > > Sincerely, > > Ben Kietzman > > > > [1]: > > > https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 > > [2]: > > > https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 > > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 > > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 > > >