Re: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?

Bryn Llewellyn Fri, 17 Jun 2022 17:14:38 -0700

> mmonc...@gmail.com <mailto:mmonc...@gmail.com> wrote:
> 
> yeah. I would expect for json or jsonb, two values, "a, b",  "a" is distinct 
> from "b" should give the same answer as "a::text is distinct from b::text".
> 
>> t...@sss.pgh.pa.us <mailto:t...@sss.pgh.pa.us> wrote:
>> 
>>> b...@yugabyte.com <mailto:r...@yugabyte.com> wrote:
>>> 
>>> I'm going to try to think like this: The number of possible spellings of 
>>> the names of keys in a JSON object is some flavor of infinite. So including 
>>> this in an object:
>>> 
>>> "k": null
>>> 
>>> really is saying something. It says that I do know about "k" and that yet I 
>>> have simply no information available about its value.
>> 
>> I'd read it as asserting that key "k" is meaningful for this object, but the 
>> correct value for that key is not known.
>> 
>> I have a hard time with your assertion that {"x": 42, "y": null} should be 
>> considered equivalent to {"x": 42}, because it would
>> render key-exists predicates useless. Either you have to say that key "y" is 
>> claimed to exist in both of these objects and indeed every object, or you 
>> have to make it fail if the key's value is null (so that it'd say "false" in 
>> both of these cases). Either of those options seems both weird and useless.
>> 
>>> The quirkiness that my first example showed can be seen differently from 
>>> how I saw it first. When I now consider this expression:
>>> 
>>> ('{"x": 42}'::jsonb)->>'y'
>>> 
>>> it seems that its evaluation should simply raise an exception. But you said:
>>> 
>>>> This produces "key y not present in JSON" but someone decided that was too 
>>>> unfriendly and so we instead produce SQL NULL.
>> 
>> Right. This is hard to justify from a purist semantic point of view, but 
>> having the operator throw an error in such cases would make it close to 
>> unusable on not-uniformly-structured data. And really the point of using 
>> JSON inside a SQL database is to cope with irregularly-structured data, so 
>> fuzziness seems like what we want.

Thank you very much for this, Tom. And thanks again to you, David, for your
input. I hope that it's clear that the purpose of my questions is to discover
what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that
motivate PostgreSQL's functionality. Sorry if my questions (here and on other
topics) might seem to challenge established wisdom and precedent.

Thank you very much for this, Tom. And thanks again to you, David, for your
earlier input. I hope that it's clear that the purpose of my questions is to
discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases
that motivate PostgreSQL's functionality. Sorry if, contrary to my intention,
my questions (here and on other topics) might seem to challenge established
wisdom and precedent.

I delayed my reply until I'd had time to think, to do some study, and (most
importantly) to implement a complete, self-contained proof-of-concept to
substantiate my conclusion. I'm ready, now, to report back.

Summary: I can meet my goal by using PG's native functionality appropriately.

So, w.r.t. this email’s subject, this:

select strip_null_keys('{"x": 42, "y": null}'::jsonb) = '{"x": 42}'::jsonb;

returns "true".

The point at issue is whether the presence of « "some key": null » is different
from the absence of "some key". And you (all) have argued that the two
locutions for what I have wanted to see as one notion are indeed different.
Well, yes, of course they are. This is tautologically true if you think only of
the Unicode text of a JSON document that's so far waiting to be ingested by who
knows what system. And you pointed out that, in PostgreSQL

my_doc ? 'some key'

detects the difference. You've convinced me that some use cases will care about
this—in other words, it's a distinction *with* a difference. But, in my use
case, the two distinct locutions bring no semantic difference. But the actual
distinction causes me a problem when I try to prove that this pair of
transformations is idempotent:

JSON → relational → JSON

But that problem is behind me now.

So... (and if you read this far at all) you can stop now unless you're
interested to read further.

Before getting to my use case, I noted that Tom said this about the fact that
using ->> to read a non-existent key returns a SQL null:

> [This] is hard to justify from a purist semantic point of view, but having
> the operator throw an error in such cases would make it close to unusable on
> not-uniformly-structured data.

It's easy to fix this, when the app requires strictness, by implementing a
user-defined operator pair, say +>> and its partner. I did this (to respect
what my use case needs) —and it passed all my tests. However, it turned out,
for reasons that I explain below, that I didn't need it. The operator's
implementation function reads the value with the appropriate native operator
and only if it returns SQL null (or JSON null) does it do more testing. First
it checks if the key is absent with the ? operator—raising an error if it is
so. Then it reads the value (again) with the -> native operator and raises an
error if it gets a JSON null.

————————————————————

My use case

I should stress that what I have, so far, is simply a demo implementation on my
laptop. All the tests that I've managed to invent (including the idempotency
test) work as I expect them to. However, any aspect of my code can be changed
in a heartbeat if I realize that it's suspect.

Here's an example of the documents that I have in my table's "jsonb" column.

{
"isbn" : "978-0-14-303809-2",
"title" : "Joy Luck Club",
"year" : 2006,
"authors" : [
{"given name": "Amy", "family name" : "Tan"}
],
"genre" : "Novel"
}

The documents are very definitely supposed to adhere to a JSON Schema.

https://json-schema.org/

But I don't need the formality of JSON Schema's notation for such a simple
case. Prose will do fine.

/* ————— START OF SPEC —————————————————————————————— */

The document's top-level object may use only these keys:

"isbn" — string
values must be unique across the entire set of documents (in other words, it
defines the unique business key); values must have this pattern:

« ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »

"title" — string

"year" — number
must be a positive integral value

"authors" — array of objects;
must be at least one object

"genre" — string

Each object in the "authors" array object may use only these keys:

"family name" — string

"given name" — string

String values other than for "isbn" are unconstrained.

Any key other than the seven listed here is illegal. The "genre" and "given
name" keys are not required. All the other keys are required.

The meaning of *required* is that no extracted value must bring a SQL null (so
a required key must not have a JSON null value).

And the meaning of *not required* is simply "no information is available for
this key" (with no nuances). The spec author goes further by adding a rule:
this meaning must be expressed by the absence of such a key.

/* ————— END OF SPEC ———————————————————————————————— */

The rule that « "some key": null » is not allowed brings the benefit that a
document can be maximally terse. Moreover, it helps document authors by
removing the need to decide which locution to use; and it helps programmers to
write code to check that incoming documents adhere to the spec and then to
extract their meaning (for example to a classic relational representation).

It's easy to see that the information content implies these business rules:

—Each book must have at least one author. Each author may be among the authors
of one or several books.

—Each book may be of exactly one (known) genre. Each genre may classify one or
several books.

My code does this:

—Ingests the input JSON documents into a "source(k... primary key, book_info
jsonb)" table.

—Shreds the books facts into a classic Codd-and-Date relational representation
(with the obvious tables "books", "genres", "authors", and "books_authors" with
the usual PK and FK constraints.

—Transforms the aggregated set of facts for each book back to a set of JSON
documents for transport to a different system. Critically, these must adhere to
the same JSON Schema (and rules) that govern the incoming documents.

This is where the idempotency requirement that I mentioned above comes from:

JSON → relational → JSON

It's been suggested that this is an ignoble and unattainable goal. I
disagree—on both counts.

I implemented two complementary functions:

—"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some
key": null »

—"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb"
value

The code checks with "no_null_keys()" that, as expected, no ingested JSON
document has an occurrence of « "some key": null ».

And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as
appropriate, any other built-in JSON function that produces a "jsonb" value.

It was straightforward to implement these two functions by using REGEXP
built-in functionality on the canonically formatted "text" value produced by
the "jsonb::text" typecast.

The check on the incoming documents is included in the
"j_books_book_info_is_conformant(jsonb)" function that is the basis of a
constraint that's created on the "source" table's "book_info" column.

More code is needed to implement other constraints like, for example, the value
of the "isbn" (string) key must satisfy a particular regular expression and the
value of the "year" (number) key must convert to a positive integer (You've
heard about that test already.)

These tests, too, (and other tests) are included in the
"j_books_book_info_is_conformant(jsonb)" function.

Critically, one test uses "jsonb_object_keys()" to scan the top-level object to
ensure that all the required keys are present, that every key has the specified
JSON data type, and that no keys that the JSON Schema doesn't mention are
present. A similar test does the same for the "authors" array. This is why I
can be sure that the native ->> and -> operators are sufficient for my purpose.

I considered using the "j_books_book_info_is_conformant(jsonb)" function in a
constraint for a domain based on "jsonb" but decided against that.

Re: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?

Reply via email to