On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote:
Hi hackers,
After further consideration, I'm withdrawing the patch.
Some fundamental questions remain unresolved:
- Should round-trip fidelity be a strict goal? By "round-trip fidelity",
I mean that data exported and then re-imported should yield exactly
the original values, including the distinction between NULL and empty
strings.
- If round-trip fidelity is a requirement, how do we distinguish NULL from empty
strings without delimiters or escapes?
- Is automatic newline detection (as in "csv" and "text") more valuable than
the ability to embed \r (CR) characters?
- Would it be better to extend the existing COPY options rather than introducing
a new format?
- Or should we consider a JSONL format instead, one that avoids the NULL/empty
string problem entirely?
No clear solution or consensus has emerged. For now, I'll step back from the
proposal. If someone wants to revisit this later, I'd be happy to contribute.
Thanks again for all the feedback and consideration.
We seem to have got seriously into the weeds, here. I'd be sorry to see
this dropped. After all, it's not something new, and while we have a
sort of workaround for "one json doc per line" it's far from obvious,
and except in a few blog posts undocumented.
I think we're trying to be far too general here but in the absence of
more general use cases. The ones I recall having encountered in the wild
are:
. one json datum per line
. one json document per file
. a sequence of json documents per file
The last one is hard to deal with, and I think I've only seen it once or
twice, so I suggest leaving it aside for now.
Notice these are all JSON. I could imagine XML might have similar
requirements, but I encounter it extremely rarely.
Regarding NULL, an empty string is not a valid JSON literal, so there
should be no confusion there. It is valid for XML, though.
Given all that I think restricting ourselves to just the JSON cases, and
possibly just to JSONL, would be perfectly reasonable.
Regarding CR, it's not a valid character in a JSON string item, although
it is valid in JSON whitespace. I would not treat it as magical unless
it immediately precedes an NL. That gives rise to a very sight
ambiguity, but I think it's one we could live with.
As for what the format is called, I don't like the "LIST" proposal much,
even for the general case. Seems too close to an array.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com