To make things clear, any of the factory functions listed below create a type that maps exactly onto an Arrow columnar layout:
https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions

For example, calling `arrow::dictionary` creates a dictionary type that exactly represents the dictionary layout specified in https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout

Similarly, if you use any of the builders listed below, what you will get at the end is data that complies with the Arrow columnar specification:
https://arrow.apache.org/docs/dev/cpp/api/builder.html

All the core Arrow C++ APIs create and process data which complies with the Arrow specification, and which is interoperable with other Arrow implementations.

Conversely, non-Arrow data such as CSV or Parquet (or Python lists, etc.) goes through dedicated converters. There is no ambiguity.


Creating top-level utilities that create non-Arrow data introduces confusion and ambiguity as to what Arrow is. Users who haven't studied the spec in detail - which is probably most users of Arrow implementations - will call `arrow::string_view(raw_pointers=true)` and might later discover that their data cannot be shared with other implementations (or, if it can, there will be an unsuspected conversion cost at the edge).

It also creates a risk of introducing a parallel Arrow-like ecosystem based on the superset of data layouts understood by Arrow C++. People may feel encouraged to code for that ecosystem, pessimizing interoperability with non-C++ runtimes.

Which is why I think those APIs, however convenient, also go against the overarching goals of the Arrow project.


If we want to keep such convenience APIs as part of Arrow C++, they should be clearly flagged as being non-Arrow compliant.

It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).

But, they could be also be provided by a distinct library.

Regards

Antoine.



Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :

Hi Ben,

Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :

@Antoine
What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property.

I'm not sure I understand your point. Dictionary encoding is part of the
Arrow spec, and considering it as a data type is an API choice that does
not violate the spec.

Raw pointers in string views is just not an Arrow format.

Regards

Antoine.

Reply via email to