> From: "Brian Goetz" <brian.go...@oracle.com> > To: "Ethan McCue" <et...@mccue.dev>, "core-libs-dev" > <core-libs-...@openjdk.java.net> > Sent: Tuesday, February 28, 2023 8:48:00 PM > Subject: Re: JEP-198 - Lets start talking about JSON
> As you can probably imagine, I've been thinking about these topics for quite a > while, ever since we started working on records and pattern matching. It > sounds > like a lot of your thoughts have followed a similar arc to ours. > I'll share with you some of our thoughts, but I can't be engaging in a > detailed > back-and-forth right now -- we have too many other things going on, and this > isn't yet on the front burner. I think there's a right time for this work, and > we're not quite there yet, but we'll get there soon enough and we'll pick up > the ball again then. > To the existential question: yes, there should be a simpler, built-in way to > parse JSON. And, as you observe, the railroad diagram in the JSON spec is a > graphical description of an algebraic data type. One of the great simplifying > effects of having algebraic data types (records + sealed classes) in the > language is that many data modeling problems collapse down to the point where > considerably less creativity is required of an API. Here's the JSON API one > can > write after literally only 30 seconds of thought: >> sealed interface JsonValue { >> record JsonString (String s)implements JsonValue { } >> record JsonNumber (double d)implements JsonValue { } >> record JsonNull ()implements JsonValue { } >> record JsonBoolean ( boolean b)implements JsonValue { } >> record JsonArray (List< JsonValue > values)implements JsonValue { } >> record JsonObject (Map<String, JsonValue > pairs)implements JsonValue { } >> } > It matches the JSON spec almost literally, and you can use pattern matching to > parse a document. (OK, there's some tiny bit of creativity here in that > True/False have been collapsed to a single JsonBoolean type, but you get my > point.) > But, we're not quite ready to put this API into the JDK, because the language > isn't *quite* there yet. Records give you nice pattern matching, but they come > at a cost; they're very specific and have rigid ideas about initialization, > which ripples into a number of constraints on an implementation (i.e., much > harder to parse lazily.) So we're waiting until we have deconstruction > patterns > (next up on the patterns parade) so that the records above can be interfaces > and still support pattern matching (and more flexibility in implementation, > including using value classes when they arrive.) It's not a long hop, though. > I agree with your assessment of streaming models; for documents too large to > fit > into memory, we'll let someone else provide a specialized solution. Streaming > and fully-materialized-tree are not the only two options; there are plenty of > points in the middle. > As to API idioms, these can be layered. The lazy-tree model outlined above can > be a foundation for data binding, dynamic mapping to records, jsonpath, etc. > But once you've made the streaming-vs-materialized choice in favor of > materialized, it's hard to imagine not having something like the above at the > base of the tower. > The question you raise about error handling is one that infuses pattern > matching > in general. Pattern matching allows us to collapse what would be a thousand > questions -- "does key X exist? is it mapped to a number? is the number in the > range of byte?" -- each with their own failure-handling path, into a single > question. That's great for reliable and readable code, but it does make errors > more opaque, because it is more like the red "check engine" light on your > dashboard. (Something like JSONPath could generate better error messages since > you've given it a declarative description of an assumed structural invariant.) > But, imperative code that has to treat each structural assumption as a > possible > control-flow point is a disaster; we've seen too much code like this already. > The ecosystem is big enough that there will be lots of people with strong > opinions that "X is the only sensible way to do it" (we've already seen > X=databinding on this thread), but the reality is that there are multiple > overlapping audiences here, and we have to be clear which audiences we are > prioritizing. We can have that debate when the time is right. > So, we'll get there, but we're waiting for one or two more bits of language > evolution to give us the substrate for the API that feels right. > Hope this helps, > -Brian You can "simulate" deconstructors by using when + instanceof, Let say we an interface with a deconstructor that can deconstruct the instance of that interface as a tuple of points interface Point { record $(int x, int y) {} $ deconstructor(); } If there is an implementation, the deconstructor is just an implementation of an instance method "deconstructor" class PointImpl implements Point { private int x; private int y; public PointImpl(int x, int y) { this.x = x; this.y = y; } @Override public $ deconstructor() { return new $(x, y); } } Then inside a switch, "case Point(int x, int y)" can be translated to "case Point p when deconstructor() instanceof Point.$(int x, int y)", like this public static void main(String[] args) { Point point = new PointImpl(3, 4); var value = switch (point) { case Point p when p.deconstructor() instanceof Point.$(int x, int y) -> x + y; default -> throw new MatchException("oops", null); }; System.out.println(value); } Rémi > On 12/15/2022 3:30 PM, Ethan McCue wrote: >> I'm writing this to drive some forward motion and to nerd-snipe those who >> know >> better than I do into putting their thoughts into words. >> There are three ways to process JSON[1] >> - Streaming (Push or Pull) >> - Traversing a Tree (Realized or Lazy) >> - Declarative Databind (N ways) >> Of these, JEP-198 explicitly ruled out providing "JAXB style type safe data >> binding." >> No justification is given, but if I had to insert my own: mapping the Json >> model >> to/from the Java/JVM object model is a cursed combo of >> - Huge possible design space >> - Unpalatably large surface for backwards compatibility >> - Serialization! Boo![2] >> So for an artifact like the JDK, it probably doesn't make sense to include. >> That >> tracks. >> It won't make everyone happy, people like databind APIs, but it tracks. >> So for the "read flow" these are the things to figure out. >> | Should Provide? | Intended User(s) | >> ----------------+-----------------+------------------+ >> Streaming Push | | | >> ----------------+-----------------+------------------+ >> Streaming Pull | | | >> ----------------+-----------------+------------------+ >> Realized Tree | | | >> ----------------+-----------------+------------------+ >> Lazy Tree | | | >> ----------------+-----------------+------------------+ >> At which point, we should talk about what "meets needs of Java developers >> using >> JSON" implies. >> JSON is ubiquitous. Most kinds of software us schmucks write could have a >> reason >> to interact with it. >> The full set of "user personas" therefore aren't practical for me to talk >> about.[3] >> JSON documents, however, are not so varied. >> - There are small ones (1-10kb) >> - There are medium ones (10-1000kb) >> - There are big ones (1000kb-???) >> - There are shallow ones >> - There are deep ones >> So that feels like an easier direction to talk about it from. >> This repo[4] has some convenient toy examples of how some of those APIs look >> in >> libraries >> in the ecosystem. Specifically the Streaming Pull and Realized Tree models. >> User r = new User(); >> while (true) { >> JsonToken token = reader.peek(); >> switch (token) { >> case BEGIN_OBJECT: >> reader.beginObject(); >> break; >> case END_OBJECT: >> reader.endObject(); >> return r; >> case NAME: >> String fieldname = reader.nextName(); >> switch (fieldname) { >> case "id": >> r.setId(reader.nextString()); >> break; >> case "index": >> r.setIndex(reader.nextInt()); >> break; >> ... >> case "friends": >> r.setFriends(new ArrayList<>()); >> Friend f = null; >> carryOn = true; >> while (carryOn) { >> token = reader.peek(); >> switch (token) { >> case BEGIN_ARRAY: >> reader.beginArray(); >> break; >> case END_ARRAY: >> reader.endArray(); >> carryOn = false; >> break; >> case BEGIN_OBJECT: >> reader.beginObject(); >> f = new Friend(); >> break; >> case END_OBJECT: >> reader.endObject(); >> r.getFriends().add(f); >> break; >> case NAME: >> String fn = reader.nextName(); >> switch (fn) { >> case "id": >> f.setId(reader.nextString()); >> break; >> case "name": >> f.setName(reader.nextString()); >> break; >> } >> break; >> } >> } >> break; >> } >> } >> I think its not hard to argue that the streaming apis are brutalist. The >> above >> is Gson, but Jackson, moshi, etc >> seem at least morally equivalent. >> Its hard to write, hard to write *correctly*, and theres is a curious >> protensity >> towards pairing it >> with anemic, mutable models. >> That being said, it handles big documents and deep documents really well. It >> also performs >> pretty darn well and is good enough as a "fallback" when the intended user >> experience >> is through something like databind. >> So what could we do meaningfully better with the language we have today/will >> have tommorow? >> - Sealed interfaces + Pattern matching could give a nicer model for tokens >> sealed interface JsonToken { >> record Field(String name) implements JsonToken {} >> record BeginArray() implements JsonToken {} >> record EndArray() implements JsonToken {} >> record BeginObject() implements JsonToken {} >> record EndObject() implements JsonToken {} >> // ... >> } >> // ... >> User r = new User(); >> while (true) { >> JsonToken token = reader.peek(); >> switch (token) { >> case BeginObject __: >> reader.beginObject(); >> break; >> case EndObject __: >> reader.endObject(); >> return r; >> case Field("id"): >> r.setId(reader.nextString()); >> break; >> case Field("index"): >> r.setIndex(reader.nextInt()); >> break; >> // ... >> case Field("friends"): >> r.setFriends(new ArrayList<>()); >> Friend f = null; >> carryOn = true; >> while (carryOn) { >> token = reader.peek(); >> switch (token) { >> // ... >> - Value classes can make it all more efficient >> sealed interface JsonToken { >> value record Field(String name) implements JsonToken {} >> value record BeginArray() implements JsonToken {} >> value record EndArray() implements JsonToken {} >> value record BeginObject() implements JsonToken {} >> value record EndObject() implements JsonToken {} >> // ... >> } >> - (Fun One) We can transform a simpler-to-write push parser into a pull >> parser >> with Coroutines >> This is just a toy we could play with while making something in the JDK. I'm >> pretty sure >> we could make a parser which feeds into something like >> interface Listener { >> void onObjectStart(); >> void onObjectEnd(); >> void onArrayStart(); >> void onArrayEnd(); >> void onField(String name); >> // ... >> } >> and invert a loop like >> while (true) { >> char c = next(); >> switch (c) { >> case '{': >> listener.onObjectStart(); >> // ... >> // ... >> } >> } >> by putting a Coroutine.yield in the callback. >> That might be a meaningful simplification in code structure, I don't know >> enough >> to say. >> But, I think there are some hard questions like >> - Is the intent[5] to be make backing parser for ecosystem databind apis? >> - Is the intent that users who want to handle big/deep documents fall back to >> this? >> - Are those new language features / conveniences enough to offset the cost of >> committing to a new api? >> - To whom exactly does a low level api provide value? >> - What benefit is standardization in the JDK? >> and just generally - who would be the consumer(s) of this? >> The other kind of API still on the table is a Tree. There are two ways to >> handle >> this >> 1. Load it into `Object`. Use a bunch of instanceof checks/casts to confirm >> what >> it actually is. >> Object v; >> User u = new User(); >> if ((v = jso.get("id")) != null) { >> u.setId((String) v); >> } >> if ((v = jso.get("index")) != null) { >> u.setIndex(((Long) v).intValue()); >> } >> if ((v = jso.get("guid")) != null) { >> u.setGuid((String) v); >> } >> if ((v = jso.get("isActive")) != null) { >> u.setIsActive(((Boolean) v)); >> } >> if ((v = jso.get("balance")) != null) { >> u.setBalance((String) v); >> } >> // ... >> if ((v = jso.get("latitude")) != null) { >> u.setLatitude(v instanceof BigDecimal ? ((BigDecimal) v).doubleValue() : >> (Double) v); >> } >> if ((v = jso.get("longitude")) != null) { >> u.setLongitude(v instanceof BigDecimal ? ((BigDecimal) v).doubleValue() : >> (Double) v); >> } >> if ((v = jso.get("greeting")) != null) { >> u.setGreeting((String) v); >> } >> if ((v = jso.get("favoriteFruit")) != null) { >> u.setFavoriteFruit((String) v); >> } >> if ((v = jso.get("tags")) != null) { >> List<Object> jsonarr = (List<Object>) v; >> u.setTags(new ArrayList<>()); >> for (Object vi : jsonarr) { >> u.getTags().add((String) vi); >> } >> } >> if ((v = jso.get("friends")) != null) { >> List<Object> jsonarr = (List<Object>) v; >> u.setFriends(new ArrayList<>()); >> for (Object vi : jsonarr) { >> Map<String, Object> jso0 = (Map<String, Object>) vi; >> Friend f = new Friend(); >> f.setId((String) jso0.get("id")); >> f.setName((String) jso0.get("name")); >> u.getFriends().add(f); >> } >> } >> 2. Have an explicit model for Json, and helper methods that do said casts[6] >> this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site"))); >> JsonArray groups = jsonObject.getJsonArray("group"); >> if(groups != null) >> { >> int len = groups.size(); >> for(int i=0; i<len; i++) >> { >> JsonObject grp = groups.getJsonObject(i); >> SNMPSetting grpSetting = readFromJson(grp); >> String grpName = grp.getString("dbgroup", null); >> if(grpName != null && grpSetting != null) >> this.groupSettings.put(grpName, grpSetting); >> } >> } >> JsonArray hosts = jsonObject.getJsonArray("host"); >> if(hosts != null) >> { >> int len = hosts.size(); >> for(int i=0; i<len; i++) >> { >> JsonObject host = hosts.getJsonObject(i); >> SNMPSetting hostSetting = readFromJson(host); >> String hostName = host.getString("dbhost", null); >> if(hostName != null && hostSetting != null) >> this.hostSettings.put(hostName, hostSetting); >> } >> } >> I think what has become easier to represent in the language nowadays is that >> explicit model for Json. >> Its the 101 lesson of sealed interfaces.[7] It feels nice and clean. >> sealed interface Json { >> final class Null implements Json {} >> final class True implements Json {} >> final class False implements Json {} >> final class Array implements Json {} >> final class Object implements Json {} >> final class String implements Json {} >> final class Number implements Json {} >> } >> And the cast-and-check approach is now more viable on account of pattern >> matching. >> if (jso.get("id") instanceof String v) { >> u.setId(v); >> } >> if (jso.get("index") instanceof Long v) { >> u.setIndex(v.intValue()); >> } >> if (jso.get("guid") instanceof String v) { >> u.setGuid(v); >> } >> // or >> if (jso.get("id") instanceof String id && >> jso.get("index") instanceof Long index && >> jso.get("guid") instanceof String guid) { >> return new User(id, index, guid, ...); // look ma, no setters! >> } >> And on the horizon, again, is value types. >> But there are problems with this approach beyond the performance >> implications of >> loading into >> a tree. >> For one, all the code samples above have different behaviors around null keys >> and missing keys >> that are not obvious from first glance. >> This won't accept any null or missing fields >> if (jso.get("id") instanceof String id && >> jso.get("index") instanceof Long index && >> jso.get("guid") instanceof String guid) { >> return new User(id, index, guid, ...); >> } >> This will accept individual null or missing fields, but also will silently >> ignore >> fields with incorrect types >> if (jso.get("id") instanceof String v) { >> u.setId(v); >> } >> if (jso.get("index") instanceof Long v) { >> u.setIndex(v.intValue()); >> } >> if (jso.get("guid") instanceof String v) { >> u.setGuid(v); >> } >> And, compared to databind where there is information about the expected >> structure of the document >> and its the job of the framework to assert that, I posit that the errors that >> would be encountered >> when writing code against this would be more like >> "something wrong with user" >> than >> "problem at users[5].name, expected string or null. got 5" >> Which feels unideal. >> One approach I find promising is something close to what Elm does with its >> decoders[8]. Not just combining assertion >> and binding like what pattern matching with records allows, but including a >> scheme for bubbling/nesting errors. >> static String string(Json json) throws JsonDecodingException { >> if (!(json instanceof Json.String jsonString)) { >> throw JsonDecodingException.of( >> "expected a string", >> json >> ); >> } else { >> return jsonString.value(); >> } >> } >> static <T> T field(Json json, String fieldName, Decoder<? extends T> >> valueDecoder) throws JsonDecodingException { >> var jsonObject = object(json); >> var value = jsonObject.get(fieldName); >> if (value == null) { >> throw JsonDecodingException.atField( >> fieldName, >> JsonDecodingException.of( >> "no value for field", >> json >> ) >> ); >> } >> else { >> try { >> return valueDecoder.decode(value); >> } catch (JsonDecodingException e) { >> throw JsonDecodingException.atField( >> fieldName, >> e >> ); >> } catch (Exception e) { >> throw JsonDecodingException.atField(fieldName, JsonDecodingException.of(e, >> value)); >> } >> } >> } >> Which I think has some benefits over the ways I've seen of working with >> trees. >> - It is declarative enough that folks who prefer databind might be happy >> enough. >> static User fromJson(Json json) { >> return new User( >> Decoder.field(json, "id", Decoder::string), >> Decoder.field(json, "index", Decoder::long_), >> Decoder.field(json, "guid", Decoder::string), >> ); >> } >> / ... >> List<User> users = Decoders.array(json, User::fromJson); >> - Handling null and optional fields could be less easily conflated >> Decoder.field(json, "id", Decoder::string); >> Decoder.nullableField(json, "id", Decoder::string); >> Decoder.optionalField(json, "id", Decoder::string); >> Decoder.optionalNullableField(json, "id", Decoder::string); >> - It composes well with user defined classes >> record Guid(String value) { >> Guid { >> // some assertions on the structure of value >> } >> } >> Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid))); >> // or even >> record Guid(String value) { >> Guid { >> // some assertions on the structure of value >> } >> static Guid fromJson(Json json) { >> return new Guid(Decoder.string(guid)); >> } >> } >> Decoder.string(json, "guid", Guid::fromJson); >> - When something goes wrong, the API can handle the fiddlyness of capturing >> information for feedback. >> In the code I've sketched out its just what field/index things went wrong at. >> Potentially >> capturing metadata like row/col numbers of the source would be sensible too. >> Its just not reasonable to expect devs to do extra work to get that and its >> really nice to give it. >> There are also some downsides like >> - I do not know how compatible it would be with lazy trees. >> Lazy trees being the only way that a tree api could handle big or deep >> documents. >> The general concept as applied in libraries like json-tree[9] is to navigate >> without >> doing any work, and that clashes with wanting to instanceof check the info at >> the >> current path. >> - It *almost* gives enough information to be a general schema approach >> If one field fails, that in the model throws an exception immediately. If an >> API >> should >> return "errors": [...], that is inconvenient to construct. >> - None of the existing popular libraries are doing this >> The only mechanics that are strictly required to give this sort of API is >> lambdas. Those have >> been out for a decade. Yes sealed interfaces make the data model prettier >> but in >> concept you >> can build the same thing on top of anything. >> I could argue that this is because of "cultural momentum" of databind or some >> other reason, >> but the fact remains that it isn't a proven out approach. >> Writing Json libraries is a todo list[10]. There are a lot of bad ideas and >> this >> might be one of the, >> - Performance impact of so many instanceof checks >> I've gotten a 4.2% slowdown compared to the "regular" tree code without the >> repeated casts. >> But that was with a parser that is 5x slower than Jacksons. (using the same >> benchmark project as for the snippets). >> I think there could be reason to believe that the JIT does well enough with >> repeated instanceof >> checks to consider it. >> My current thinking is that - despite not solving for large or deep >> documents - >> starting with a really "dumb" realized tree api >> might be the right place to start for the read side of a potential incubator >> module. >> But regardless - this feels like a good time to start more concrete >> conversations. I fell I should cap this email since I've reached the point of >> decoherence and haven't even mentioned the write side of things >> [1]: [ http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html | >> http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html ] >> [2]: [ https://security.snyk.io/vuln/maven?search=jackson-databind | >> https://security.snyk.io/vuln/maven?search=jackson-databind ] >> [3]: I only know like 8 people >> [4]: [ >> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java >> | >> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java >> ] >> [5]: When I say "intent", I do so knowing full well no one has been actively >> thinking of this for an entire Game of Thrones >> [6]: [ >> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java >> | >> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java >> ] >> [7]: [ https://www.infoq.com/articles/data-oriented-programming-java/ | >> https://www.infoq.com/articles/data-oriented-programming-java/ ] >> [8]: [ https://package.elm-lang.org/packages/elm/json/latest/Json-Decode | >> https://package.elm-lang.org/packages/elm/json/latest/Json-Decode ] >> [9]: [ https://github.com/jbee/json-tree | https://github.com/jbee/json-tree >> ] >> [10]: [ https://stackoverflow.com/a/14442630/2948173 | >> https://stackoverflow.com/a/14442630/2948173 ] >> [11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd >> time in >> its history. >> [12]: To me, the fact that is still an open JEP is more a social convenience >> than anything. I could just as easily writing this exact same email about >> TOML.