On 5/6/25 01:11, Tom Lane wrote:
> The attached patch is a response to the discussion at [1], where
> it emerged that lots of rows with null join keys can send a hash
> join into too-many-batches hell, if they are on the outer side
> of the join so that they must be null-extended not just discarded.
> This isn't really surprising given that such rows will certainly
> end up in the same hash bucket, and no amount of splitting can
> reduce the size of that bucket.  (I'm a bit surprised that the
> growEnabled heuristic didn't kick in, but it seems it didn't,
> at least not up to several million batches.)
> 

I don't think that's too surprising - growEnabled depends on all tuples
getting into the same batch during a split. But even if there are many
duplicate values, real-world data sets often have a couple more tuples
that just happen to fall into that bucket too. And then during the split
some get into one batch and some get into another.

My personal experience is that the growEnabled heuristics is overly
sensitive, and probably does not trigger very often. It can also get set
to "true" to early, but that's (much) harder to hit.

I have suggested to make growEnabled less strict in [2], i.e. to
calculate the threshold as percentage of the batch, and not disable
growth permanently. But it was orthogonal to what that thread did.

But more importantly, wasn't the issue discussed in [1] about parallel
hash joins? I got quite confused while reading the thread ... I'm asking
because growEnabled is checked only in ExecHashIncreaseNumBatches, not
in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins
don't use growEnabled at all, no?

[2]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f%40vondra.me


> Thinking about that, it occurred to me to wonder why we are putting
> null-keyed tuples into the hash table at all.  They cannot match
> anything, so all we really have to do with them is emit one
> null-extended copy.  Awhile later I had the attached, which shoves
> such rows into a tuplestore that's separate from the hash table
> proper, ensuring that they can't bollix our algorithms for when to
> grow the hash table.  (For tuples coming from the right input, we
> need to use a tuplestore in case we're asked to rescan the existing
> hashtable.  For tuples coming from the left input, we could
> theoretically emit 'em and forget 'em immediately, but that'd require
> some major code restructuring so I decided to just use a tuplestore
> there too.)
> 
> This passes check-world, and I've extended a couple of existing test
> cases to ensure that the new code paths are exercised.  I've not done
> any real performance testing, though.
> 

Are you planning to? If not, I can try to collect some numbers, but I
can't promise that before pgconf.dev.

I'd be surprised if this was a regression, the hash table lookups are
not exactly free. And even if it was a minor regression, it'd affect
only cases with many NULL keys, but it improves robustness.

BTW do you consider this to be a bugfix for PG18? Or would it have to
wait for PG19 at this point?


regards

-- 
Tomas Vondra



Reply via email to