I think either path:

* Canonical extension type
* First-class type in the Type union in Flatbuffers

would be OK. The canonical extension type option is the preferable
path here, I think, because it allows Arrow implementations without
any special handling for JSON to allow the data to pass through as
Binary or String. Implementations like C++ could see the extension
type metadata and construct an instance of arrow::Type::JSON /
JsonArray, etc., but when it gets serialized back to Parquet or Arrow
IPC it looks like binary/string (since JSON can be utf-16/utf-32,
right?) with additional field metadata.

On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota
<pgollak...@google.com.invalid> wrote:
>
> Thanks Micah!
>
> That's certainly one option we could use. It would likely be easier to
> implement at the outset. I wonder if something like arrow::json() would
> open up more options down the line.
>
> This brings up an interesting question of whether Parquet logical types
> should have a 1:1 mapping with Arrow logical types. Would we also want an
> arrow::bson()? I wouldn't think so. Maybe
> arrow::json({encoding=string/bson})? I'm not sure which would be better if
> we want to enable compute engines to manipulate the JSON data.
>
> On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Just to be clear, I think we are referring to a "well known"/canonical
> > extension type [1] here?   I'd also be in favor of this (Disclaimer I'm a
> > colleague of Padeep's)
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> >
> > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > This seems like a common-enough data type that having a first-class
> > > logical type would be a good idea (perhaps even more so than UUID!).
> > > Compute engines would be able to implement kernels that provide
> > > manipulations of JSON data similar to what you can do with jq or
> > > GraphQL.
> > >
> > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
> > > <pgollak...@google.com.invalid> wrote:
> > > >
> > > > Hi Team!
> > > >
> > > > I filed ARROW-17255 to support the JSON logical type in Arrow.
> > Initially
> > > > I'm only interested in C++ support that wraps a string. I imagine that
> > as
> > > > Arrow and Parquet get more sophisticated, we might want to do more
> > > > interesting things (shredding?) with the JSON.
> > > >
> > > > David mentioned that there have been discussions around other "common"
> > > > extensions like UUID. Is this something that the community would be
> > > > interested in? My goal at the moment is to be able to export data from
> > > > BigQuery to Parquet with the correct LogicalType set in the exported
> > > files.
> > > >
> > > > Thanks!
> > > > Pradeep
> > >
> >
>
>
> --
> Pradeep

Reply via email to