Hi, Looking at the main patch (v28).
First off: This looks pretty good, the code's quite readable now (especially compared to earlier versions), the comments are good. Really like the nodeHash split, and the always inline hackery in nodeHashjoin. Think we're getting really really close. * ExecHashJoinImpl * - * This function implements the Hybrid Hashjoin algorithm. + * This function implements the Hybrid Hashjoin algorithm. By forcing it + * to be always inline many compilers are able to specialize it for + * parallel = true/false without repeating the code. * what about adding the above explanation for the always inline? + /* + * So far we have no idea whether there are any other participants, + * and if so, what phase they are working on. The only thing we care + * about at this point is whether someone has already created the + * SharedHashJoinBatch objects, the main hash table for batch 0 and + * (if necessary) the skew hash table yet. One backend will be + * elected to do that now if necessary. + */ The 'yet' sounds a bit weird in combination with the 'already'. + static void + ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) + ... + case PHJ_GROW_BATCHES_ELECTING: + /* Elect one participant to prepare the operation. */ That's a good chunk of code. I'm ever so slightly inclined to put that into a separate function. But I'm not sure it'd look good. Feel entirely free to disregard. + static HashJoinTuple + ExecParallelHashLoadTuple(HashJoinTable hashtable, MinimalTuple tuple, + dsa_pointer *shared) Not really happy with the name. ExecParallelHashTableInsert() calling ExecParallelHashLoadTuple() to insert a tuple into the hashtable doesn't quite strike me as right; the naming similarity to ExecParallelHashTableLoad seems problematic too. ExecParallelHashAllocTuple() or such? One could argue it'd not be a bad idea to keep a similar split as dense_alloc() and memcpy() have, but I'm not really convinced by myself. Hm. + case PHJ_GROW_BATCHES_ELECTING: + /* Elect one participant to prepare the operation. */ Imo that comment could use a one-line summary of what preparing means. + /* + * We probably also need a smaller bucket array. How many + * tuples do we expect per batch, assuming we have only + * half of them so far? Makes sense, but did cost me a minute of thinking. Maybe add a short explanation why. + case PHJ_GROW_BATCHES_ALLOCATING: + /* Wait for the above to be finished. */ + BarrierArriveAndWait(&pstate->grow_batches_barrier, + WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING); + /* Fall through. */ Just to make sure I understand: The "empty" phase is to protect the process of the electee doing the sizing calculations etc? And the reason that's not possible to do just by waiting for PHJ_GROW_BATCHES_REPARTITIONING is that somebody could dynamically arrive, i.e. it'd not be needed in a statically sized barrier? Pretty tired here, sorry ;) + /* reset temp memory each time to avoid leaks from qual expr */ + ResetExprContext(econtext); + + if (ExecQual(hjclauses, econtext)) I personally think it's better to avoid this pattern and store the result of the ExecQual() in a variable, ResetExprContext() and then act on the result. No need to keep memory around for longer, and for bigger contexts you're more likely to have all the metadata in cache. I'd previously thought about introducing ExecQualAndReset() or such... * IDENTIFICATION * src/backend/executor/nodeHashjoin.c * + * PARALLELISM + * This is a pretty good explanation. How about adding a reference to it from nodeHash.c's header? +static TupleTableSlot * /* return: a tuple or NULL */ +ExecHashJoin(PlanState *pstate) +{ + /* + * On sufficiently smart compilers this should be inlined with the + * parallel-aware branches removed. + */ + return ExecHashJoinImpl(pstate, false); Ah, the explanation I desired above is here. Still seems good to have a comment at the somewhat suspicious use of always_inline. + + /* + * If we started up so late that the shared batches have been freed + * already by ExecHashTableDetach(), then we are finished. + */ + if (!DsaPointerIsValid(hashtable->parallel_state->batches)) + return false; This is really the only place that weird condition is detected? And why is that possible in the first place? And if possible, shouldn't we have detected earlier? Also, if possible, what prevents this to occur in a way that test fails, because pstate->batches is freed, but not yet reset? + while ((tuple = sts_parallel_scan_next(hashtable->batches[batchno].inner_tuples, + &hashvalue))) That's a fairly long line. Couldn't this whole block made more readable by using a temp variable for inner_tuples? Ok, eye's are closing against my will. This looks pretty damn close. - Andres