Re: Performance issues with parallelism and LIMIT

Tomas Vondra Wed, 26 Nov 2025 09:30:59 -0800

On 11/26/25 09:15, David Geier wrote:
> On 19.11.2025 21:03, Tomas Vondra wrote:
> 
>> Right, that's why I suggested to have a function the nodes would call in
>> suitable places.
>>
>>>>> I like that idea, even though it would still not work while a node is
>>>>> doing the crunching. That is after it has pulled all rows and before it
>>>>> can return the first row. During this time the node won't call
>>>>> ExecProcNode().
>>>>>
>>>> True. Perhaps we could provide a function nodes could call in suitable
>>>> places to check whether to end?
>>> This function would then also be required by the base relation scans.
>>> And we would have to call it more or less in the same places
>>> CHECK_FOR_INTERRUPTS() is called today.
>>>
>>
>> Yes, but I don't think CHECK_FOR_INTERRUPTS() would be a good place to
>> manipulate the executor state. Maybe you could do some magic with
>> siglongjmp(), but I have "funny" feeling about that - I wouldn't be
>> surprised if that interfered with elog(), which is the only other place
>> using siglongjmp() AFAICS.
> You had the right intuition. siglongjmp-ing out leaves behind per-node
> instrumentation state and CurrentMemoryContext in an unexpected state.
> 
> Example: jumping out of the executor, after we've called
> InstrStartNode(), but before we call InstrStopNode(). Subsequently
> calling InstrEndLoop() will give the error you encountered. A similar
> problem exists for CurrentMemoryContext which is not properly reset.
> 
> I didn't encounter these issues during my testing because they're
> largely timing dependent. Execution can end before the other workers
> have started executing. So the stopping logic didn't kick in.
> 
> Both issues can be accounted for when jumping out but this seems
> somewhat hacky.
>


The question is if this are the only two such issues possible, and I'm
afraid the answer is "no" :-(

The question is if "exiting" from any place calling CFI leaves the
execution state in a valid state. Valid enough so that we can call
ExecEndNode() on all the nodes, including the one from which we exited.
But I don't think we can rely on that. The node can do multiple steps,
interleaved with CFI, not expecting that only one of them happens. I
assume this would cause a lot of issues ...

>> Which is why I suggested maybe it should be handled in execProcnode
>> (which would take care of cases where we produce a tuple), and then let
>> nodes to optionally check the flag too (through a new function).
>>
>> I haven't tried doing this, so maybe I'm missing something ...
>>
>>> Beyond that, code such as heap_nextslot() or index_getnext_slot() don't
>>> have access to the PlanState which would contain the stop flag. So that
>>> would have to be propagated downwards as well.
>>>
>>> All of that would make for a fairly big patch, which the initial patch
>>> avoids.
> This turned out to be false. See below.
> 
>> Right. I don't think we can set the flag in plan/executor state, because
>> that's not available in signal handler / ProcessInterrupts() ... It'd
>> need to be a global variable, I guess.
> What we can do is use a global variable. That also makes checking the
> flag a lot easier because we don't need to pass it around through
> multiple abstraction layers.
> 
> What needs to be taken care of though is to only bail from scans that
> are actually initiated from plan nodes. There are many places in the
> code that use e.g. the table AM API directly. We don't want to bail from
> these scans. Without flagging if a scan should bail or not, e.g.
> ScanPgRelation() will return no tuple and therefore relcache code starts
> failing.
> 
> The new patch accounts for that by introducing a new TableScanDescData
> flag SO_OBEY_PARALLEL_WORKER_STOP, which indicates if the scan should
> adhere to the parallel worker stop or not.
> 
> Stopping is broadcasted to all parallel workers via SendProcSignal().
> The stop variable is checked with the new
> CHECK_FOR_PARALLEL_WORKER_STOP() macro.
> 
> In the PoC implementation I've for now only changed nodeSeqScan.c. If
> there's agreement to pursue this approach, I'll change the other places
> as well. Naming can also likely be still improved.
> 

This assumes we need to "exit" only from a heapam scan. That's true for
the example, but is that enough in general? What if the worker already
finished it's plan, and is now busy doing something else expensive?
Could be a big sort, aggregation, ... Can we do something about these
cases too?


regards

-- 
Tomas Vondra

Re: Performance issues with parallelism and LIMIT

Reply via email to