On 2024-01-03 We 08:45, Robert Haas wrote:
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <and...@dunslane.net> wrote:
Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.
IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.


I don't quite follow.

Say we have a document with an array 1m objects, each with a field called "color". As it stands we'll allocate space for that field name 1m times. Using a hash table we'd allocated space for it once. And allocating the memory isn't free, although it might be cheaper than doing hash lookups.

I guess we can benchmark it and see what the performance impact of using a hash table might be.

Another possibility would be simply to have the callback free the field name after use. for the parse_manifest code that could be a one-line addition to the code at the bottom of json_object_manifest_field_start().


The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.
Hmm. This makes me wonder if you've measured how much actual leakage there is?


No I haven't. I have simply theorized about how much memory we might consume if nothing were done by the callers to free the memory.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com



Reply via email to