> > > 2. What do we do about different non-utf8 encodings? There does not > appear > > to be a consensus yet on this point. One option is to only allow utf8 > > encoding and force implementers to convert non-utf8 to utf8. Second > option > > is to allow all encodings and capture the encoding in the metadata (I'm > > leaning towards this option).
Allowing non-utf8 encodings adds complexity for everyone. Disallowing > them only adds complexity for the tiny minority of producers of non-utf8 > JSON. I'd also add that if we only allow extension on utf8 today, it would be a forward/backward compatible change to allow parameterizing the extension for bytes type by encoding if we wanted to support it in the future. Parquet also only supports UTF-8 [1] for its logical JSON type. [1] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote: > > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : > > Thanks for all the great feedback. > > > > To proceed forward, we seem to need decisions around the following: > > > > 1. Whether to use arrow extensions or first class types. The consensus is > > building towards using arrow extensions. > > +1 > > > 2. What do we do about different non-utf8 encodings? There does not > appear > > to be a consensus yet on this point. One option is to only allow utf8 > > encoding and force implementers to convert non-utf8 to utf8. Second > option > > is to allow all encodings and capture the encoding in the metadata (I'm > > leaning towards this option). > > Allowing non-utf8 encodings adds complexity for everyone. Disallowing > them only adds complexity for the tiny minority of producers of non-utf8 > JSON. > > > 3. What do we do about the different formats of JSON (string, BSON, > UBJSON, > > etc.)? > > There are no "different formats of JSON". BSON etc. are unrelated formats. > > Regards > > Antoine. >