Sounds good to me too. +1 on the canonical extension type option; maybe it
should end up as a first-class type, but I'd like to see us try it without
first and see what that tells us about the path for having an extension
type get promoted to being a first-class type. This is something that has
been discussed in principle before, but I don't know we've worked out what
it would look like in practice.

I spoke with someone at the RStudio conference this week who requested this
type as well. Relatedly, there is a gap in the C++ library where we don't
have compute functions for JSON parsing and serializing, it's only in the
JSON file reader (and in test utilities etc.). So if you get data that has
a column of JSON strings, you can't do anything with it (unless both my
memory and grep fail me).

Neal

On Fri, Jul 29, 2022 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I think either path:
>
> * Canonical extension type
> * First-class type in the Type union in Flatbuffers
>
> would be OK. The canonical extension type option is the preferable
> path here, I think, because it allows Arrow implementations without
> any special handling for JSON to allow the data to pass through as
> Binary or String. Implementations like C++ could see the extension
> type metadata and construct an instance of arrow::Type::JSON /
> JsonArray, etc., but when it gets serialized back to Parquet or Arrow
> IPC it looks like binary/string (since JSON can be utf-16/utf-32,
> right?) with additional field metadata.
>
> On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota
> <pgollak...@google.com.invalid> wrote:
> >
> > Thanks Micah!
> >
> > That's certainly one option we could use. It would likely be easier to
> > implement at the outset. I wonder if something like arrow::json() would
> > open up more options down the line.
> >
> > This brings up an interesting question of whether Parquet logical types
> > should have a 1:1 mapping with Arrow logical types. Would we also want an
> > arrow::bson()? I wouldn't think so. Maybe
> > arrow::json({encoding=string/bson})? I'm not sure which would be better
> if
> > we want to enable compute engines to manipulate the JSON data.
> >
> > On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Just to be clear, I think we are referring to a "well known"/canonical
> > > extension type [1] here?   I'd also be in favor of this (Disclaimer
> I'm a
> > > colleague of Padeep's)
> > >
> > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> > >
> > >
> > > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > This seems like a common-enough data type that having a first-class
> > > > logical type would be a good idea (perhaps even more so than UUID!).
> > > > Compute engines would be able to implement kernels that provide
> > > > manipulations of JSON data similar to what you can do with jq or
> > > > GraphQL.
> > > >
> > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
> > > > <pgollak...@google.com.invalid> wrote:
> > > > >
> > > > > Hi Team!
> > > > >
> > > > > I filed ARROW-17255 to support the JSON logical type in Arrow.
> > > Initially
> > > > > I'm only interested in C++ support that wraps a string. I imagine
> that
> > > as
> > > > > Arrow and Parquet get more sophisticated, we might want to do more
> > > > > interesting things (shredding?) with the JSON.
> > > > >
> > > > > David mentioned that there have been discussions around other
> "common"
> > > > > extensions like UUID. Is this something that the community would be
> > > > > interested in? My goal at the moment is to be able to export data
> from
> > > > > BigQuery to Parquet with the correct LogicalType set in the
> exported
> > > > files.
> > > > >
> > > > > Thanks!
> > > > > Pradeep
> > > >
> > >
> >
> >
> > --
> > Pradeep
>

Reply via email to