> mmonc...@gmail.com <mailto:mmonc...@gmail.com> wrote:
> 
> yeah. I would expect for json or jsonb, two values, "a, b",  "a" is distinct 
> from "b" should give the same answer as "a::text is distinct from b::text".
> 
>> t...@sss.pgh.pa.us <mailto:t...@sss.pgh.pa.us> wrote:
>> 
>>> b...@yugabyte.com <mailto:r...@yugabyte.com> wrote:
>>> 
>>> I'm going to try to think like this: The number of possible spellings of 
>>> the names of keys in a JSON object is some flavor of infinite. So including 
>>> this in an object:
>>> 
>>> "k": null
>>> 
>>> really is saying something. It says that I do know about "k" and that yet I 
>>> have simply no information available about its value.
>> 
>> I'd read it as asserting that key "k" is meaningful for this object, but the 
>> correct value for that key is not known.
>> 
>> I have a hard time with your assertion that {"x": 42, "y": null} should be 
>> considered equivalent to {"x": 42}, because it would
>> render key-exists predicates useless. Either you have to say that key "y" is 
>> claimed to exist in both of these objects and indeed every object, or you 
>> have to make it fail if the key's value is null (so that it'd say "false" in 
>> both of these cases). Either of those options seems both weird and useless.
>> 
>>> The quirkiness that my first example showed can be seen differently from 
>>> how I saw it first. When I now consider this expression:
>>> 
>>> ('{"x": 42}'::jsonb)->>'y'
>>> 
>>> it seems that its evaluation should simply raise an exception. But you said:
>>> 
>>>> This produces "key y not present in JSON" but someone decided that was too 
>>>> unfriendly and so we instead produce SQL NULL.
>> 
>> Right. This is hard to justify from a purist semantic point of view, but 
>> having the operator throw an error in such cases would make it close to 
>> unusable on not-uniformly-structured data. And really the point of using 
>> JSON inside a SQL database is to cope with irregularly-structured data, so 
>> fuzziness seems like what we want.

Thank you very much for this, Tom. And thanks again to you, David, for your 
input. I hope that it's clear that the purpose of my questions is to discover 
what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that 
motivate PostgreSQL's functionality. Sorry if my questions (here and on other 
topics) might seem to challenge established wisdom and precedent.

Thank you very much for this, Tom. And thanks again to you, David, for your 
earlier input. I hope that it's clear that the purpose of my questions is to 
discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases 
that motivate PostgreSQL's functionality. Sorry if, contrary to my intention, 
my questions (here and on other topics) might seem to challenge established 
wisdom and precedent.

I delayed my reply until I'd had time to think, to do some study, and (most 
importantly) to implement a complete, self-contained proof-of-concept to 
substantiate my conclusion. I'm ready, now, to report back.

Summary: I can meet my goal by using PG's native functionality appropriately.

So, w.r.t. this email’s subject, this:

  select strip_null_keys('{"x": 42, "y": null}'::jsonb) = '{"x": 42}'::jsonb;

returns "true".

The point at issue is whether the presence of « "some key": null » is different 
from the absence of "some key". And you (all) have argued that the two 
locutions for what I have wanted to see as one notion are indeed different. 
Well, yes, of course they are. This is tautologically true if you think only of 
the Unicode text of a JSON document that's so far waiting to be ingested by who 
knows what system. And you pointed out that, in PostgreSQL

  my_doc ? 'some key'

detects the difference. You've convinced me that some use cases will care about 
this—in other words, it's a distinction *with* a difference. But, in my use 
case, the two distinct locutions bring no semantic difference. But the actual 
distinction causes me a problem when I try to prove that this pair of 
transformations is idempotent:

  JSON → relational → JSON

But that problem is behind me now.

So... (and if you read this far at all) you can stop now unless you're 
interested to read further.

Before getting to my use case, I noted that Tom said this about the fact that 
using ->> to read a non-existent key returns a SQL null:

> [This] is hard to justify from a purist semantic point of view, but having 
> the operator throw an error in such cases would make it close to unusable on 
> not-uniformly-structured data.


It's easy to fix this, when the app requires strictness, by implementing a 
user-defined operator pair, say +>> and its partner. I did this (to respect 
what my use case needs) —and it passed all my tests. However, it turned out, 
for reasons that I explain below, that I didn't need it. The operator's 
implementation function reads the value with the appropriate native operator 
and only if it returns SQL null (or JSON null) does it do more testing. First 
it checks if the key is absent with the ? operator—raising an error if it is 
so. Then it reads the value (again) with the -> native operator and raises an 
error if it gets a JSON null.

————————————————————

My use case

I should stress that what I have, so far, is simply a demo implementation on my 
laptop. All the tests that I've managed to invent (including the idempotency 
test) work as I expect them to. However, any aspect of my code can be changed 
in a heartbeat if I realize that it's suspect.

Here's an example of the documents that I have in my table's "jsonb" column.

{
  "isbn"    : "978-0-14-303809-2",
  "title"   : "Joy Luck Club",
  "year"    : 2006, 
  "authors" : [
                {"given name": "Amy", "family name" : "Tan"}
              ],
  "genre"   : "Novel"
}

The documents are very definitely supposed to adhere to a JSON Schema.

  https://json-schema.org/

But I don't need the formality of JSON Schema's notation for such a simple 
case. Prose will do fine.

/* ————— START OF SPEC —————————————————————————————— */


The document's top-level object may use only these keys:

"isbn" — string
values must be unique across the entire set of documents (in other words, it 
defines the unique business key); values must have this pattern:

  « ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »

"title" — string

"year" — number
must be a positive integral value

"authors" — array of objects;
must be at least one object

"genre" — string

Each object in the "authors" array object may use only these keys:

"family name" — string

"given name" — string

String values other than for "isbn" are unconstrained.

Any key other than the seven listed here is illegal. The "genre" and "given 
name" keys are not required. All the other keys are required.

The meaning of *required* is that no extracted value must bring a SQL null (so 
a required key must not have a JSON null value).

And the meaning of *not required* is simply "no information is available for 
this key" (with no nuances). The spec author goes further by adding a rule: 
this meaning must be expressed by the absence of such a key.


/* ————— END OF SPEC ———————————————————————————————— */

The rule that « "some key": null » is not allowed brings the benefit that a 
document can be maximally terse. Moreover, it helps document authors by 
removing the need to decide which locution to use; and it helps programmers to 
write code to check that incoming documents adhere to the spec and then to 
extract their meaning (for example to a classic relational representation).

It's easy to see that the information content implies these business rules:

—Each book must have at least one author. Each author may be among the authors 
of one or several books.

—Each book may be of exactly one (known) genre. Each genre may classify one or 
several books.

My code does this:

—Ingests the input JSON documents into a "source(k... primary key, book_info 
jsonb)" table.

—Shreds the books facts into a classic Codd-and-Date relational representation 
(with the obvious tables "books", "genres", "authors", and "books_authors" with 
the usual PK and FK constraints.

—Transforms the aggregated set of facts for each book back to a set of JSON 
documents for transport to a different system. Critically, these must adhere to 
the same JSON Schema (and rules) that govern the incoming documents.

This is where the idempotency requirement that I mentioned above comes from:

  JSON → relational → JSON

It's been suggested that this is an ignoble and unattainable goal. I 
disagree—on both counts.

I implemented two complementary functions:

—"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some 
key": null »

—"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" 
value

The code checks with "no_null_keys()" that, as expected, no ingested JSON 
document has an occurrence of « "some key": null ».

And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as 
appropriate, any other built-in JSON function that produces a "jsonb" value.

It was straightforward to implement these two functions by using REGEXP 
built-in functionality on the canonically formatted "text" value produced by 
the "jsonb::text" typecast.

The check on the incoming documents is included in the 
"j_books_book_info_is_conformant(jsonb)" function that is the basis of a 
constraint that's created on the "source" table's "book_info" column.

More code is needed to implement other constraints like, for example, the value 
of the "isbn" (string) key must satisfy a particular regular expression and the 
value of the "year" (number) key must convert to a positive integer (You've 
heard about that test already.)

These tests, too, (and other tests) are included in the 
"j_books_book_info_is_conformant(jsonb)" function.

Critically, one test uses "jsonb_object_keys()" to scan the top-level object to 
ensure that all the required keys are present, that every key has the specified 
JSON data type, and that no keys that the JSON Schema doesn't mention are 
present. A similar test does the same for the "authors" array. This is why I 
can be sure that the native ->> and -> operators are sufficient for my purpose.

I considered using the "j_books_book_info_is_conformant(jsonb)" function in a 
constraint for a domain based on "jsonb" but decided against that.

Reply via email to