Re: JEP-198 - Lets start talking about JSON

Remi Forax Tue, 28 Feb 2023 13:25:28 -0800

> From: "Brian Goetz" <brian.go...@oracle.com>
> To: "Ethan McCue" <et...@mccue.dev>, "core-libs-dev"
> <core-libs-...@openjdk.java.net>
> Sent: Tuesday, February 28, 2023 8:48:00 PM
> Subject: Re: JEP-198 - Lets start talking about JSON


> As you can probably imagine, I've been thinking about these topics for quite a
> while, ever since we started working on records and pattern matching. It 
> sounds
> like a lot of your thoughts have followed a similar arc to ours.

> I'll share with you some of our thoughts, but I can't be engaging in a 
> detailed
> back-and-forth right now -- we have too many other things going on, and this
> isn't yet on the front burner. I think there's a right time for this work, and
> we're not quite there yet, but we'll get there soon enough and we'll pick up
> the ball again then.

> To the existential question: yes, there should be a simpler, built-in way to
> parse JSON. And, as you observe, the railroad diagram in the JSON spec is a
> graphical description of an algebraic data type. One of the great simplifying
> effects of having algebraic data types (records + sealed classes) in the
> language is that many data modeling problems collapse down to the point where
> considerably less creativity is required of an API. Here's the JSON API one 
> can
> write after literally only 30 seconds of thought:

>> sealed interface JsonValue {

>> record JsonString (String s)implements JsonValue { }

>> record JsonNumber (double d)implements JsonValue { }

>> record JsonNull ()implements JsonValue { }

>> record JsonBoolean ( boolean b)implements JsonValue { }

>> record JsonArray (List< JsonValue > values)implements JsonValue { }

>> record JsonObject (Map<String, JsonValue > pairs)implements JsonValue { }

>> }
> It matches the JSON spec almost literally, and you can use pattern matching to
> parse a document. (OK, there's some tiny bit of creativity here in that
> True/False have been collapsed to a single JsonBoolean type, but you get my
> point.)

> But, we're not quite ready to put this API into the JDK, because the language
> isn't *quite* there yet. Records give you nice pattern matching, but they come
> at a cost; they're very specific and have rigid ideas about initialization,
> which ripples into a number of constraints on an implementation (i.e., much
> harder to parse lazily.) So we're waiting until we have deconstruction 
> patterns
> (next up on the patterns parade) so that the records above can be interfaces
> and still support pattern matching (and more flexibility in implementation,
> including using value classes when they arrive.) It's not a long hop, though.

> I agree with your assessment of streaming models; for documents too large to 
> fit
> into memory, we'll let someone else provide a specialized solution. Streaming
> and fully-materialized-tree are not the only two options; there are plenty of
> points in the middle.

> As to API idioms, these can be layered. The lazy-tree model outlined above can
> be a foundation for data binding, dynamic mapping to records, jsonpath, etc.
> But once you've made the streaming-vs-materialized choice in favor of
> materialized, it's hard to imagine not having something like the above at the
> base of the tower.

> The question you raise about error handling is one that infuses pattern 
> matching
> in general. Pattern matching allows us to collapse what would be a thousand
> questions -- "does key X exist? is it mapped to a number? is the number in the
> range of byte?" -- each with their own failure-handling path, into a single
> question. That's great for reliable and readable code, but it does make errors
> more opaque, because it is more like the red "check engine" light on your
> dashboard. (Something like JSONPath could generate better error messages since
> you've given it a declarative description of an assumed structural invariant.)
> But, imperative code that has to treat each structural assumption as a 
> possible
> control-flow point is a disaster; we've seen too much code like this already.

> The ecosystem is big enough that there will be lots of people with strong
> opinions that "X is the only sensible way to do it" (we've already seen
> X=databinding on this thread), but the reality is that there are multiple
> overlapping audiences here, and we have to be clear which audiences we are
> prioritizing. We can have that debate when the time is right.

> So, we'll get there, but we're waiting for one or two more bits of language
> evolution to give us the substrate for the API that feels right.

> Hope this helps,
> -Brian
You can "simulate" deconstructors by using when + instanceof, 

Let say we an interface with a deconstructor that can deconstruct the instance 
of that interface as a tuple of points 
interface Point { 
record $(int x, int y) {} 
$ deconstructor(); 
} 

If there is an implementation, the deconstructor is just an implementation of 
an instance method "deconstructor" 
class PointImpl implements Point { 
private int x; 
private int y; 

public PointImpl(int x, int y) { 
this.x = x; 
this.y = y; 
} 

@Override 
public $ deconstructor() { 
return new $(x, y); 
} 
} 

Then inside a switch, "case Point(int x, int y)" can be translated to "case 
Point p when deconstructor() instanceof Point.$(int x, int y)", like this 
public static void main(String[] args) { 
Point point = new PointImpl(3, 4); 
var value = switch (point) { 
case Point p when p.deconstructor() instanceof Point.$(int x, int y) -> x + y; 
default -> throw new MatchException("oops", null); 
}; 
System.out.println(value); 
} 

Rémi 

> On 12/15/2022 3:30 PM, Ethan McCue wrote:

>> I'm writing this to drive some forward motion and to nerd-snipe those who 
>> know
>> better than I do into putting their thoughts into words.

>> There are three ways to process JSON[1]
>> - Streaming (Push or Pull)
>> - Traversing a Tree (Realized or Lazy)
>> - Declarative Databind (N ways)

>> Of these, JEP-198 explicitly ruled out providing "JAXB style type safe data
>> binding."

>> No justification is given, but if I had to insert my own: mapping the Json 
>> model
>> to/from the Java/JVM object model is a cursed combo of
>> - Huge possible design space
>> - Unpalatably large surface for backwards compatibility
>> - Serialization! Boo![2]

>> So for an artifact like the JDK, it probably doesn't make sense to include. 
>> That
>> tracks.
>> It won't make everyone happy, people like databind APIs, but it tracks.

>> So for the "read flow" these are the things to figure out.

>> | Should Provide? | Intended User(s) |
>> ----------------+-----------------+------------------+
>> Streaming Push | | |
>> ----------------+-----------------+------------------+
>> Streaming Pull | | |
>> ----------------+-----------------+------------------+
>> Realized Tree | | |
>> ----------------+-----------------+------------------+
>> Lazy Tree | | |
>> ----------------+-----------------+------------------+

>> At which point, we should talk about what "meets needs of Java developers 
>> using
>> JSON" implies.

>> JSON is ubiquitous. Most kinds of software us schmucks write could have a 
>> reason
>> to interact with it.
>> The full set of "user personas" therefore aren't practical for me to talk
>> about.[3]

>> JSON documents, however, are not so varied.

>> - There are small ones (1-10kb)
>> - There are medium ones (10-1000kb)
>> - There are big ones (1000kb-???)

>> - There are shallow ones
>> - There are deep ones

>> So that feels like an easier direction to talk about it from.

>> This repo[4] has some convenient toy examples of how some of those APIs look 
>> in
>> libraries
>> in the ecosystem. Specifically the Streaming Pull and Realized Tree models.

>> User r = new User();
>> while (true) {
>> JsonToken token = reader.peek();
>> switch (token) {
>> case BEGIN_OBJECT:
>> reader.beginObject();
>> break;
>> case END_OBJECT:
>> reader.endObject();
>> return r;
>> case NAME:
>> String fieldname = reader.nextName();
>> switch (fieldname) {
>> case "id":
>> r.setId(reader.nextString());
>> break;
>> case "index":
>> r.setIndex(reader.nextInt());
>> break;
>> ...
>> case "friends":
>> r.setFriends(new ArrayList<>());
>> Friend f = null;
>> carryOn = true;
>> while (carryOn) {
>> token = reader.peek();
>> switch (token) {
>> case BEGIN_ARRAY:
>> reader.beginArray();
>> break;
>> case END_ARRAY:
>> reader.endArray();
>> carryOn = false;
>> break;
>> case BEGIN_OBJECT:
>> reader.beginObject();
>> f = new Friend();
>> break;
>> case END_OBJECT:
>> reader.endObject();
>> r.getFriends().add(f);
>> break;
>> case NAME:
>> String fn = reader.nextName();
>> switch (fn) {
>> case "id":
>> f.setId(reader.nextString());
>> break;
>> case "name":
>> f.setName(reader.nextString());
>> break;
>> }
>> break;
>> }
>> }
>> break;
>> }
>> }

>> I think its not hard to argue that the streaming apis are brutalist. The 
>> above
>> is Gson, but Jackson, moshi, etc
>> seem at least morally equivalent.

>> Its hard to write, hard to write *correctly*, and theres is a curious 
>> protensity
>> towards pairing it
>> with anemic, mutable models.

>> That being said, it handles big documents and deep documents really well. It
>> also performs
>> pretty darn well and is good enough as a "fallback" when the intended user
>> experience
>> is through something like databind.

>> So what could we do meaningfully better with the language we have today/will
>> have tommorow?

>> - Sealed interfaces + Pattern matching could give a nicer model for tokens

>> sealed interface JsonToken {
>> record Field(String name) implements JsonToken {}
>> record BeginArray() implements JsonToken {}
>> record EndArray() implements JsonToken {}
>> record BeginObject() implements JsonToken {}
>> record EndObject() implements JsonToken {}
>> // ...
>> }

>> // ...

>> User r = new User();
>> while (true) {
>> JsonToken token = reader.peek();
>> switch (token) {
>> case BeginObject __:
>> reader.beginObject();
>> break;
>> case EndObject __:
>> reader.endObject();
>> return r;
>> case Field("id"):
>> r.setId(reader.nextString());
>> break;
>> case Field("index"):
>> r.setIndex(reader.nextInt());
>> break;

>> // ...

>> case Field("friends"):
>> r.setFriends(new ArrayList<>());
>> Friend f = null;
>> carryOn = true;
>> while (carryOn) {
>> token = reader.peek();
>> switch (token) {
>> // ...

>> - Value classes can make it all more efficient

>> sealed interface JsonToken {
>> value record Field(String name) implements JsonToken {}
>> value record BeginArray() implements JsonToken {}
>> value record EndArray() implements JsonToken {}
>> value record BeginObject() implements JsonToken {}
>> value record EndObject() implements JsonToken {}
>> // ...
>> }

>> - (Fun One) We can transform a simpler-to-write push parser into a pull 
>> parser
>> with Coroutines

>> This is just a toy we could play with while making something in the JDK. I'm
>> pretty sure
>> we could make a parser which feeds into something like

>> interface Listener {
>> void onObjectStart();
>> void onObjectEnd();
>> void onArrayStart();
>> void onArrayEnd();
>> void onField(String name);
>> // ...
>> }

>> and invert a loop like

>> while (true) {
>> char c = next();
>> switch (c) {
>> case '{':
>> listener.onObjectStart();
>> // ...
>> // ...
>> }
>> }

>> by putting a Coroutine.yield in the callback.

>> That might be a meaningful simplification in code structure, I don't know 
>> enough
>> to say.

>> But, I think there are some hard questions like

>> - Is the intent[5] to be make backing parser for ecosystem databind apis?
>> - Is the intent that users who want to handle big/deep documents fall back to
>> this?
>> - Are those new language features / conveniences enough to offset the cost of
>> committing to a new api?
>> - To whom exactly does a low level api provide value?
>> - What benefit is standardization in the JDK?

>> and just generally - who would be the consumer(s) of this?

>> The other kind of API still on the table is a Tree. There are two ways to 
>> handle
>> this

>> 1. Load it into `Object`. Use a bunch of instanceof checks/casts to confirm 
>> what
>> it actually is.

>> Object v;
>> User u = new User();

>> if ((v = jso.get("id")) != null) {
>> u.setId((String) v);
>> }
>> if ((v = jso.get("index")) != null) {
>> u.setIndex(((Long) v).intValue());
>> }
>> if ((v = jso.get("guid")) != null) {
>> u.setGuid((String) v);
>> }
>> if ((v = jso.get("isActive")) != null) {
>> u.setIsActive(((Boolean) v));
>> }
>> if ((v = jso.get("balance")) != null) {
>> u.setBalance((String) v);
>> }
>> // ...
>> if ((v = jso.get("latitude")) != null) {
>> u.setLatitude(v instanceof BigDecimal ? ((BigDecimal) v).doubleValue() :
>> (Double) v);
>> }
>> if ((v = jso.get("longitude")) != null) {
>> u.setLongitude(v instanceof BigDecimal ? ((BigDecimal) v).doubleValue() :
>> (Double) v);
>> }
>> if ((v = jso.get("greeting")) != null) {
>> u.setGreeting((String) v);
>> }
>> if ((v = jso.get("favoriteFruit")) != null) {
>> u.setFavoriteFruit((String) v);
>> }
>> if ((v = jso.get("tags")) != null) {
>> List<Object> jsonarr = (List<Object>) v;
>> u.setTags(new ArrayList<>());
>> for (Object vi : jsonarr) {
>> u.getTags().add((String) vi);
>> }
>> }
>> if ((v = jso.get("friends")) != null) {
>> List<Object> jsonarr = (List<Object>) v;
>> u.setFriends(new ArrayList<>());
>> for (Object vi : jsonarr) {
>> Map<String, Object> jso0 = (Map<String, Object>) vi;
>> Friend f = new Friend();
>> f.setId((String) jso0.get("id"));
>> f.setName((String) jso0.get("name"));
>> u.getFriends().add(f);
>> }
>> }

>> 2. Have an explicit model for Json, and helper methods that do said casts[6]

>> this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site")));
>> JsonArray groups = jsonObject.getJsonArray("group");
>> if(groups != null)
>> {
>> int len = groups.size();
>> for(int i=0; i<len; i++)
>> {
>> JsonObject grp = groups.getJsonObject(i);
>> SNMPSetting grpSetting = readFromJson(grp);
>> String grpName = grp.getString("dbgroup", null);
>> if(grpName != null && grpSetting != null)
>> this.groupSettings.put(grpName, grpSetting);
>> }
>> }
>> JsonArray hosts = jsonObject.getJsonArray("host");
>> if(hosts != null)
>> {
>> int len = hosts.size();
>> for(int i=0; i<len; i++)
>> {
>> JsonObject host = hosts.getJsonObject(i);
>> SNMPSetting hostSetting = readFromJson(host);
>> String hostName = host.getString("dbhost", null);
>> if(hostName != null && hostSetting != null)
>> this.hostSettings.put(hostName, hostSetting);
>> }
>> }

>> I think what has become easier to represent in the language nowadays is that
>> explicit model for Json.
>> Its the 101 lesson of sealed interfaces.[7] It feels nice and clean.

>> sealed interface Json {
>> final class Null implements Json {}
>> final class True implements Json {}
>> final class False implements Json {}
>> final class Array implements Json {}
>> final class Object implements Json {}
>> final class String implements Json {}
>> final class Number implements Json {}
>> }

>> And the cast-and-check approach is now more viable on account of pattern
>> matching.

>> if (jso.get("id") instanceof String v) {
>> u.setId(v);
>> }
>> if (jso.get("index") instanceof Long v) {
>> u.setIndex(v.intValue());
>> }
>> if (jso.get("guid") instanceof String v) {
>> u.setGuid(v);
>> }

>> // or

>> if (jso.get("id") instanceof String id &&
>> jso.get("index") instanceof Long index &&
>> jso.get("guid") instanceof String guid) {
>> return new User(id, index, guid, ...); // look ma, no setters!
>> }

>> And on the horizon, again, is value types.

>> But there are problems with this approach beyond the performance 
>> implications of
>> loading into
>> a tree.

>> For one, all the code samples above have different behaviors around null keys
>> and missing keys
>> that are not obvious from first glance.

>> This won't accept any null or missing fields

>> if (jso.get("id") instanceof String id &&
>> jso.get("index") instanceof Long index &&
>> jso.get("guid") instanceof String guid) {
>> return new User(id, index, guid, ...);
>> }

>> This will accept individual null or missing fields, but also will silently
>> ignore
>> fields with incorrect types

>> if (jso.get("id") instanceof String v) {
>> u.setId(v);
>> }
>> if (jso.get("index") instanceof Long v) {
>> u.setIndex(v.intValue());
>> }
>> if (jso.get("guid") instanceof String v) {
>> u.setGuid(v);
>> }

>> And, compared to databind where there is information about the expected
>> structure of the document
>> and its the job of the framework to assert that, I posit that the errors that
>> would be encountered
>> when writing code against this would be more like

>> "something wrong with user"

>> than

>> "problem at users[5].name, expected string or null. got 5"

>> Which feels unideal.

>> One approach I find promising is something close to what Elm does with its
>> decoders[8]. Not just combining assertion
>> and binding like what pattern matching with records allows, but including a
>> scheme for bubbling/nesting errors.

>> static String string(Json json) throws JsonDecodingException {
>> if (!(json instanceof Json.String jsonString)) {
>> throw JsonDecodingException.of(
>> "expected a string",
>> json
>> );
>> } else {
>> return jsonString.value();
>> }
>> }

>> static <T> T field(Json json, String fieldName, Decoder<? extends T>
>> valueDecoder) throws JsonDecodingException {
>> var jsonObject = object(json);
>> var value = jsonObject.get(fieldName);
>> if (value == null) {
>> throw JsonDecodingException.atField(
>> fieldName,
>> JsonDecodingException.of(
>> "no value for field",
>> json
>> )
>> );
>> }
>> else {
>> try {
>> return valueDecoder.decode(value);
>> } catch (JsonDecodingException e) {
>> throw JsonDecodingException.atField(
>> fieldName,
>> e
>> );
>> } catch (Exception e) {
>> throw JsonDecodingException.atField(fieldName, JsonDecodingException.of(e,
>> value));
>> }
>> }
>> }

>> Which I think has some benefits over the ways I've seen of working with 
>> trees.

>> - It is declarative enough that folks who prefer databind might be happy 
>> enough.

>> static User fromJson(Json json) {
>> return new User(
>> Decoder.field(json, "id", Decoder::string),
>> Decoder.field(json, "index", Decoder::long_),
>> Decoder.field(json, "guid", Decoder::string),
>> );
>> }

>> / ...

>> List<User> users = Decoders.array(json, User::fromJson);

>> - Handling null and optional fields could be less easily conflated

>> Decoder.field(json, "id", Decoder::string);

>> Decoder.nullableField(json, "id", Decoder::string);

>> Decoder.optionalField(json, "id", Decoder::string);

>> Decoder.optionalNullableField(json, "id", Decoder::string);

>> - It composes well with user defined classes

>> record Guid(String value) {
>> Guid {
>> // some assertions on the structure of value
>> }
>> }

>> Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid)));

>> // or even

>> record Guid(String value) {
>> Guid {
>> // some assertions on the structure of value
>> }

>> static Guid fromJson(Json json) {
>> return new Guid(Decoder.string(guid));
>> }
>> }

>> Decoder.string(json, "guid", Guid::fromJson);

>> - When something goes wrong, the API can handle the fiddlyness of capturing
>> information for feedback.

>> In the code I've sketched out its just what field/index things went wrong at.
>> Potentially
>> capturing metadata like row/col numbers of the source would be sensible too.

>> Its just not reasonable to expect devs to do extra work to get that and its
>> really nice to give it.

>> There are also some downsides like

>> - I do not know how compatible it would be with lazy trees.

>> Lazy trees being the only way that a tree api could handle big or deep
>> documents.
>> The general concept as applied in libraries like json-tree[9] is to navigate
>> without
>> doing any work, and that clashes with wanting to instanceof check the info at
>> the
>> current path.

>> - It *almost* gives enough information to be a general schema approach

>> If one field fails, that in the model throws an exception immediately. If an 
>> API
>> should
>> return "errors": [...], that is inconvenient to construct.

>> - None of the existing popular libraries are doing this

>> The only mechanics that are strictly required to give this sort of API is
>> lambdas. Those have
>> been out for a decade. Yes sealed interfaces make the data model prettier 
>> but in
>> concept you
>> can build the same thing on top of anything.

>> I could argue that this is because of "cultural momentum" of databind or some
>> other reason,
>> but the fact remains that it isn't a proven out approach.

>> Writing Json libraries is a todo list[10]. There are a lot of bad ideas and 
>> this
>> might be one of the,

>> - Performance impact of so many instanceof checks

>> I've gotten a 4.2% slowdown compared to the "regular" tree code without the
>> repeated casts.

>> But that was with a parser that is 5x slower than Jacksons. (using the same
>> benchmark project as for the snippets).
>> I think there could be reason to believe that the JIT does well enough with
>> repeated instanceof
>> checks to consider it.

>> My current thinking is that - despite not solving for large or deep 
>> documents -
>> starting with a really "dumb" realized tree api
>> might be the right place to start for the read side of a potential incubator
>> module.

>> But regardless - this feels like a good time to start more concrete
>> conversations. I fell I should cap this email since I've reached the point of
>> decoherence and haven't even mentioned the write side of things

>> [1]: [ http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html |
>> http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html ]
>> [2]: [ https://security.snyk.io/vuln/maven?search=jackson-databind |
>> https://security.snyk.io/vuln/maven?search=jackson-databind ]
>> [3]: I only know like 8 people
>> [4]: [
>> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java
>> |
>> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java
>> ]
>> [5]: When I say "intent", I do so knowing full well no one has been actively
>> thinking of this for an entire Game of Thrones
>> [6]: [
>> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java
>> |
>> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java
>> ]
>> [7]: [ https://www.infoq.com/articles/data-oriented-programming-java/ |
>> https://www.infoq.com/articles/data-oriented-programming-java/ ]
>> [8]: [ https://package.elm-lang.org/packages/elm/json/latest/Json-Decode |
>> https://package.elm-lang.org/packages/elm/json/latest/Json-Decode ]
>> [9]: [ https://github.com/jbee/json-tree | https://github.com/jbee/json-tree 
>> ]
>> [10]: [ https://stackoverflow.com/a/14442630/2948173 |
>> https://stackoverflow.com/a/14442630/2948173 ]
>> [11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd 
>> time in
>> its history.
>> [12]: To me, the fact that is still an open JEP is more a social convenience
>> than anything. I could just as easily writing this exact same email about 
>> TOML.

Re: JEP-198 - Lets start talking about JSON

Reply via email to