Re: [DISCUSS][C++] Raw pointer string views

Antoine Pitrou Thu, 28 Sep 2023 03:20:41 -0700

To make things clear, any of the factory functions listed below create atype that maps exactly onto an Arrow columnar layout:

https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions

For example, calling `arrow::dictionary` creates a dictionary type thatexactly represents the dictionary layout specified inhttps://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout

Similarly, if you use any of the builders listed below, what you willget at the end is data that complies with the Arrow columnar specification:

https://arrow.apache.org/docs/dev/cpp/api/builder.html

All the core Arrow C++ APIs create and process data which complies withthe Arrow specification, and which is interoperable with other Arrowimplementations.

Conversely, non-Arrow data such as CSV or Parquet (or Python lists,etc.) goes through dedicated converters. There is no ambiguity.

Creating top-level utilities that create non-Arrow data introducesconfusion and ambiguity as to what Arrow is. Users who haven't studiedthe spec in detail - which is probably most users of Arrowimplementations - will call `arrow::string_view(raw_pointers=true)` andmight later discover that their data cannot be shared with otherimplementations (or, if it can, there will be an unsuspected conversioncost at the edge).

It also creates a risk of introducing a parallel Arrow-like ecosystembased on the superset of data layouts understood by Arrow C++. Peoplemay feel encouraged to code for that ecosystem, pessimizinginteroperability with non-C++ runtimes.

Which is why I think those APIs, however convenient, also go against theoverarching goals of the Arrow project.

If we want to keep such convenience APIs as part of Arrow C++, theyshould be clearly flagged as being non-Arrow compliant.

It could be by naming (e.g. `arrow::non_arrow_string_view()`) or byspecific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).


But, they could be also be provided by a distinct library.

Regards

Antoine.



Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :


Hi Ben,

Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :


@Antoine

What this PR is creating is an "unofficial" Arrow format, with data

types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property.


I'm not sure I understand your point. Dictionary encoding is part of the
Arrow spec, and considering it as a data type is an API choice that does
not violate the spec.

Raw pointers in string views is just not an Arrow format.

Regards

Antoine.

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to