To make things clear, any of the factory functions listed below create a
type that maps exactly onto an Arrow columnar layout:
https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
For example, calling `arrow::dictionary` creates a dictionary type that
exactly represents the dictionary layout specified in
https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
Similarly, if you use any of the builders listed below, what you will
get at the end is data that complies with the Arrow columnar specification:
https://arrow.apache.org/docs/dev/cpp/api/builder.html
All the core Arrow C++ APIs create and process data which complies with
the Arrow specification, and which is interoperable with other Arrow
implementations.
Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
etc.) goes through dedicated converters. There is no ambiguity.
Creating top-level utilities that create non-Arrow data introduces
confusion and ambiguity as to what Arrow is. Users who haven't studied
the spec in detail - which is probably most users of Arrow
implementations - will call `arrow::string_view(raw_pointers=true)` and
might later discover that their data cannot be shared with other
implementations (or, if it can, there will be an unsuspected conversion
cost at the edge).
It also creates a risk of introducing a parallel Arrow-like ecosystem
based on the superset of data layouts understood by Arrow C++. People
may feel encouraged to code for that ecosystem, pessimizing
interoperability with non-C++ runtimes.
Which is why I think those APIs, however convenient, also go against the
overarching goals of the Arrow project.
If we want to keep such convenience APIs as part of Arrow C++, they
should be clearly flagged as being non-Arrow compliant.
It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by
specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).
But, they could be also be provided by a distinct library.
Regards
Antoine.
Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
Hi Ben,
Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
@Antoine
What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.
We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property.
I'm not sure I understand your point. Dictionary encoding is part of the
Arrow spec, and considering it as a data type is an API choice that does
not violate the spec.
Raw pointers in string views is just not an Arrow format.
Regards
Antoine.