I think either path: * Canonical extension type * First-class type in the Type union in Flatbuffers
would be OK. The canonical extension type option is the preferable path here, I think, because it allows Arrow implementations without any special handling for JSON to allow the data to pass through as Binary or String. Implementations like C++ could see the extension type metadata and construct an instance of arrow::Type::JSON / JsonArray, etc., but when it gets serialized back to Parquet or Arrow IPC it looks like binary/string (since JSON can be utf-16/utf-32, right?) with additional field metadata. On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota <pgollak...@google.com.invalid> wrote: > > Thanks Micah! > > That's certainly one option we could use. It would likely be easier to > implement at the outset. I wonder if something like arrow::json() would > open up more options down the line. > > This brings up an interesting question of whether Parquet logical types > should have a 1:1 mapping with Arrow logical types. Would we also want an > arrow::bson()? I wouldn't think so. Maybe > arrow::json({encoding=string/bson})? I'm not sure which would be better if > we want to enable compute engines to manipulate the JSON data. > > On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Just to be clear, I think we are referring to a "well known"/canonical > > extension type [1] here? I'd also be in favor of this (Disclaimer I'm a > > colleague of Padeep's) > > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > > > > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > This seems like a common-enough data type that having a first-class > > > logical type would be a good idea (perhaps even more so than UUID!). > > > Compute engines would be able to implement kernels that provide > > > manipulations of JSON data similar to what you can do with jq or > > > GraphQL. > > > > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota > > > <pgollak...@google.com.invalid> wrote: > > > > > > > > Hi Team! > > > > > > > > I filed ARROW-17255 to support the JSON logical type in Arrow. > > Initially > > > > I'm only interested in C++ support that wraps a string. I imagine that > > as > > > > Arrow and Parquet get more sophisticated, we might want to do more > > > > interesting things (shredding?) with the JSON. > > > > > > > > David mentioned that there have been discussions around other "common" > > > > extensions like UUID. Is this something that the community would be > > > > interested in? My goal at the moment is to be able to export data from > > > > BigQuery to Parquet with the correct LogicalType set in the exported > > > files. > > > > > > > > Thanks! > > > > Pradeep > > > > > > > > -- > Pradeep