> > It would be reasonable to restrict JSON to utf8, and tell people they > need to transcode in the rare cases where some obnoxious software > outputs utf16-encoded JSON.
+1 I think this aligns with the latest JSON RFC [1] as well. Sounds good to me too. +1 on the canonical extension type option; maybe it > should end up as a first-class type, but I'd like to see us try it without > first and see what that tells us about the path for having an extension > type get promoted to being a first-class type. This is something that has > been discussed in principle before, but I don't know we've worked out what > it would look like in practice. >From prior discussions, we agreed that it made sense to approach JSON as an extension type [2]. As noted previously on the thread, I don't think this precludes having API's in C++/Python that make the type look the same as a natively supported type, but there might be constraints we uncover as we move forward with implementation. I don't think we reached an exact conclusion on canonical extension types but [3] was the last conversation. I think the main question is if there are maintainers for other languages that want to add the extension type, I can probably find some time for Java. [1] https://datatracker.ietf.org/doc/html/rfc8259#section-8.1 [2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t (sorry I still need to document the outcome of this discussion). [3] https://lists.apache.org/thread/bd0ttt725jqn5ylsp8v006rpfymow3mn On Sat, Jul 30, 2022 at 12:14 PM Antoine Pitrou <anto...@python.org> wrote: > > Le 30/07/2022 à 01:02, Wes McKinney a écrit : > > I think either path: > > > > * Canonical extension type > > * First-class type in the Type union in Flatbuffers > > > > would be OK. The canonical extension type option is the preferable > > path here, I think, because it allows Arrow implementations without > > any special handling for JSON to allow the data to pass through as > > Binary or String. Implementations like C++ could see the extension > > type metadata and construct an instance of arrow::Type::JSON / > > JsonArray, etc., but when it gets serialized back to Parquet or Arrow > > IPC it looks like binary/string (since JSON can be utf-16/utf-32, > > right?) with additional field metadata. > > It would be reasonable to restrict JSON to utf8, and tell people they > need to transcode in the rare cases where some obnoxious software > outputs utf16-encoded JSON. > > And I agree a canonical extension type would be massively more useful > for JSON than for UUID (which basically doesn't make sense: a UUID is an > opaque binary string for all practical purposes). > > Regards > > Antoine. >