On Thu, 10 Feb 2022 at 14:22, Antoine Pitrou <anto...@python.org> wrote: > > > Le 10/02/2022 à 14:09, Alessandro Molina a écrit : > > Mentioned this already to Joris, but want to make sure we don't miss it. > > > > C-Data and thus ARROW:extension:metadata was mostly designed for shipping > > data to different processes within the same host. > > ARROW:extension:metadata is unrelated to the C data interface. It can > be transmitted over IPC.
What Alessandro meant is that, *if* you use the binary format defined by the C Data Interface (for the ArrowSchema::metadata) also in general for encoding the value in a "ARROW:extension:metadata" metadata key, then this value is not only used in the C Data Interface itself but also for example for IPC. And at that point, as extension type, you need to specify the endianness used for the formatting of the serialized metadata. > > > > Json would definitely get rid of the endianess problem at the cost of a > > greater size and a more complex parser. But there are superminimal json > > parsers designed specifically for embedding like Jasmine ( > > https://github.com/zserge/jsmn ) > > Interesting. Jsmn is more of a lexer than a complete parser (for > example, you need to decode numbers/booleans/... yourself, you need to > unescape strings yourself). That said, for our use case it might be > sufficient. > > Regards > > Antoine. > > > > > On Wed, Feb 9, 2022 at 2:51 AM Dewey Dunnington <de...@voltrondata.com> > > wrote: > > > >> I'll share a bit more about geospatial extension types that Joris > >> mentioned. I'm new to the Arrow community and didn't know that there were > >> any restrictions on metadata values (the C Data interface docs don't seem > >> to indicate that there are restrictions, or if it's there I missed it!), so > >> I used the same encoding for the ARROW:extension:metadata that's used to > >> encode the parent metadata (int32 num_items, int32 name_len, > >> char[name_len], int32 value_len, char[value_len], etc..). I did this > >> because I needed two key/value pairs (geodesic = true/false; crs = > >> some_coordinate_reference_system) and already had the code to iterate over > >> the parent metadata. I'm not saying that it's any pinnacle of elegant code > >> (still very much a prototype), but it only takes about 30 lines of C to do > >> this [1]. > >> > >> I prototyped the extension types for geospatial using the C data interface, > >> the idea being that a header-only helper file (geoarrow.hpp) could be > >> distributed that would make it an attractive and easy alternative to > >> well-known binary (WKB) to pass geometries around between libraries (e.g., > >> GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also > >> vendor a JSON parser [2] seems a bit anti-social and restricts where that > >> extension type is useful, although I understand that it's not the use case > >> that many might have. > >> > >> There are definitely reasonable ways to do what I'm trying to do without > >> resorting to a binary encoding, and JSON could probably even work...I'm > >> just trying to share the use-case since it seems like this kind of > >> environment isn't how folks envisioned extension types being used. > >> > >> [1] > >> > >> https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542 > >> [2] The commonly vendored JSON parser in geospatial libraries is this one: > >> https://github.com/nlohmann/json > >> > >> On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote: > >> > >>> I think I'm +0 but lean slightly towards JSON. > >>> > >>> In favor of binary I would guess that most extension types are going > >>> to have relatively simple parameterization (to the point that > >>> protobuf/flatbuffers isn't really needed). For example, the substrate > >>> consumer PR has five extension types at the moment (e.g. uuid, > >>> varchar) and only two of them are parameterized and each of these by a > >>> single int32_t. It might be interesting to see what kinds of > >>> extension types the geospatial community uses. > >>> > >>> That being said, this sort of parsing isn't really on any kind of > >>> critical path. It's very likely that users (not Arrow developers) > >>> will be creating and working with extension types. These users are > >>> likely going to default to JSON (or pickle or XML). If our "well > >>> known types" use JSON then it will be more easily recognizable to > >>> users what is going on. > >>> > >>> -Weston > >>> > >>> On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche > >>> <jorisvandenboss...@gmail.com> wrote: > >>>> > >>>> On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão < > >>> jorgecarlei...@gmail.com> > >>>> wrote: > >>>> > >>>>> ... > >>>>> > >>>>> Wrt to binary, imo the challenge is: > >>>>> * we state that backward incompatible changes to the c data interface > >>>>> require a new spec [1] > >>>>> > >>>> > >>>> Note that this discussion wouldn't change anything about the C Data > >>>> Interface spec itself. The discussion is only about the *value* that is > >>> put > >>>> in one of the key-value metadata fields. The C Data Interface spec > >>> defines > >>>> how the metadata needs to be stored, but doesn't specify anything about > >>> the > >>>> actual value of one of the key-value metadata fields. > >>>> > >>>> > >>>>> * we state that the metadata is a binary string [2] > >>>>> * a valid string is a subset of all valid byte arrays and thus > >>> removing " > >>>>> *string*" from the spec is backward incompatible > >>>>> > >>>>> If we write invalid utf8 to it and a reader assumes utf8 when reading > >>> it, > >>>>> we trigger undefined behavior. > >>>>> > >>>>> I was a bit surprised by ARROW-15613 - my understanding is that the > >> c++ > >>>>> implementation is not following the spec, and if we at arrow2 were > >> not > >>> be > >>>>> checking for utf8, we would be exposing a vulnerability (at least > >>> according > >>>>> to Rust's standards). We just checked it out of luck (it is O(1), so > >>> why > >>>>> not). > >>>>> > >>>> > >>>> Yes, the C++ implementation is indeed not following the spec. See the > >>>> "[DISCUSS] Binary Values in Key value pairs" thread ( > >>>> https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr). > >> Let's > >>>> maybe keep this part of the discussion there? > >>> > >> > >