On Thu, 10 Feb 2022 at 14:22, Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 10/02/2022 à 14:09, Alessandro Molina a écrit :
> > Mentioned this already to Joris, but want to make sure we don't miss it.
> >
> > C-Data and thus ARROW:extension:metadata was mostly designed for shipping
> > data to different processes within the same host.
>
> ARROW:extension:metadata is unrelated to the C data interface.  It can
> be transmitted over IPC.

What Alessandro meant is that, *if* you use the binary format defined
by the C Data Interface (for the ArrowSchema::metadata) also in
general for encoding the value in a "ARROW:extension:metadata"
metadata key, then this value is not only used in the C Data Interface
itself but also for example for IPC. And at that point, as extension
type, you need to specify the endianness used for the formatting of
the serialized metadata.

>
>
>  > Json would definitely get rid of the endianess problem at the cost of a
>  > greater size and a more complex parser. But there are superminimal json
>  > parsers designed specifically for embedding like Jasmine (
>  > https://github.com/zserge/jsmn )
>
> Interesting. Jsmn is more of a lexer than a complete parser (for
> example, you need to decode numbers/booleans/... yourself, you need to
> unescape strings yourself). That said, for our use case it might be
> sufficient.
>
> Regards
>
> Antoine.
>
> >
> > On Wed, Feb 9, 2022 at 2:51 AM Dewey Dunnington <de...@voltrondata.com>
> > wrote:
> >
> >> I'll share a bit more about geospatial extension types that Joris
> >> mentioned. I'm new to the Arrow community and didn't know that there were
> >> any restrictions on metadata values (the C Data interface docs don't seem
> >> to indicate that there are restrictions, or if it's there I missed it!), so
> >> I used the same encoding for the ARROW:extension:metadata that's used to
> >> encode the parent metadata (int32 num_items, int32 name_len,
> >> char[name_len], int32 value_len, char[value_len],  etc..). I did this
> >> because I needed two key/value pairs (geodesic = true/false; crs =
> >> some_coordinate_reference_system) and already had the code to iterate over
> >> the parent metadata. I'm not saying that it's any pinnacle of elegant code
> >> (still very much a prototype), but it only takes about 30 lines of C to do
> >> this [1].
> >>
> >> I prototyped the extension types for geospatial using the C data interface,
> >> the idea being that a header-only helper file (geoarrow.hpp) could be
> >> distributed that would make it an attractive and easy alternative to
> >> well-known binary (WKB) to pass geometries around between libraries (e.g.,
> >> GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also
> >> vendor a JSON parser [2] seems a bit anti-social and restricts where that
> >> extension type is useful, although I understand that it's not the use case
> >> that many might have.
> >>
> >> There are definitely reasonable ways to do what I'm trying to do without
> >> resorting to a binary encoding, and JSON could probably even work...I'm
> >> just trying to share the use-case since it seems like this kind of
> >> environment isn't how folks envisioned extension types being used.
> >>
> >> [1]
> >>
> >> https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542
> >> [2] The commonly vendored JSON parser in geospatial libraries is this one:
> >> https://github.com/nlohmann/json
> >>
> >> On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote:
> >>
> >>> I think I'm +0 but lean slightly towards JSON.
> >>>
> >>> In favor of binary I would guess that most extension types are going
> >>> to have relatively simple parameterization (to the point that
> >>> protobuf/flatbuffers isn't really needed).  For example, the substrate
> >>> consumer PR has five extension types at the moment (e.g. uuid,
> >>> varchar) and only two of them are parameterized and each of these by a
> >>> single int32_t.  It might be interesting to see what kinds of
> >>> extension types the geospatial community uses.
> >>>
> >>> That being said, this sort of parsing isn't really on any kind of
> >>> critical path.  It's very likely that users (not Arrow developers)
> >>> will be creating and working with extension types.  These users are
> >>> likely going to default to JSON (or pickle or XML).  If our "well
> >>> known types" use JSON then it will be more easily recognizable to
> >>> users what is going on.
> >>>
> >>> -Weston
> >>>
> >>> On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche
> >>> <jorisvandenboss...@gmail.com> wrote:
> >>>>
> >>>> On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> ...
> >>>>>
> >>>>> Wrt to binary, imo the challenge is:
> >>>>> * we state that backward incompatible changes to the c data interface
> >>>>> require a new spec [1]
> >>>>>
> >>>>
> >>>> Note that this discussion wouldn't change anything about the C Data
> >>>> Interface spec itself. The discussion is only about the *value* that is
> >>> put
> >>>> in one of the key-value metadata fields. The C Data Interface spec
> >>> defines
> >>>> how the metadata needs to be stored, but doesn't specify anything about
> >>> the
> >>>> actual value of one of the key-value metadata fields.
> >>>>
> >>>>
> >>>>> * we state that the metadata is a binary string [2]
> >>>>> * a valid string is a subset of all valid byte arrays and thus
> >>> removing "
> >>>>> *string*" from the spec is backward incompatible
> >>>>>
> >>>>> If we write invalid utf8 to it and a reader assumes utf8 when reading
> >>> it,
> >>>>> we trigger undefined behavior.
> >>>>>
> >>>>> I was a bit surprised by ARROW-15613 - my understanding is that the
> >> c++
> >>>>> implementation is not following the spec, and if we at arrow2 were
> >> not
> >>> be
> >>>>> checking for utf8, we would be exposing a vulnerability (at least
> >>> according
> >>>>> to Rust's standards). We just checked it out of luck (it is O(1), so
> >>> why
> >>>>> not).
> >>>>>
> >>>>
> >>>> Yes, the C++ implementation is indeed not following the spec. See the
> >>>> "[DISCUSS] Binary Values in Key value pairs" thread (
> >>>> https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr).
> >> Let's
> >>>> maybe keep this part of the discussion there?
> >>>
> >>
> >

Reply via email to