Hi, On 2024-02-15 14:42:11 +0530, Robert Haas wrote: > I think the issue is very general. We have lots of subsystems that > both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS(). > If we process an interrupt while that code is in the middle of > manipulating its global variables and then again re-enter that code, > bad things might happen. We could try to rule that out by analyzing > all such subsystems and all places where CHECK_FOR_INTERRUPTS() may > appear, but that's very difficult. Suppose we took the alternative > approach recommended by Andrey Lepikhov in > http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com > and instead set a flag that gets handled in some suitable place in the > executor code, like ExecProcNode(). If we did that, then we would know > that we're not in the middle of a syscache lookup, a catcache lookup, > a heavyweight lock acquisition, an ereport, or any other low-level > subsystem call that might create problems for the EXPLAIN code. > > In that design, the hack shown above would go away, and we could be > much more certain that we don't need any other similar hacks for other > subsystems. The only downside is that we might experience a slightly > longer delay before a requested EXPLAIN plan actually shows up, but > that seems like a pretty small price to pay for being able to reason > about the behavior of the system.
I am very wary of adding overhead to ExecProcNode() - I'm quite sure that adding code there would trigger visible overhead for query times. If we went with something like tht approach, I think we'd have to do something like redirecting node->ExecProcNode to a wrapper, presumably from within a CFI. That wrapper could then implement the explain support, without slowing down the normal execution path. > I don't *think* there are any cases where we run in the executor for a > particularly long time without a new call to ExecProcNode(). I guess it depends on what you call a long time. A large sort, for example, could spend a fair amount of time inside tuplesort, similarly, a gather node might need to wait for a worker for a while etc. > It's really hard for me to accept that the heavyweight lock problem > for which the patch contains a workaround is the only one that exists. > I can't see any reason why that should be true. I suspect you're right. Greetings, Andres Freund