On Thu, Jul 18, 2019 at 9:45 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > I think this is going in the wrong direction. Nodes should *always* > assume that a rescan is possible until ExecEndNode is called. > If you want to do otherwise, you are going to be inventing a whole > bunch of complicated and doubtless-initially-buggy control logic > to pass down information about whether a rescan might be possible. > That doesn't sound like a recipe for a back-patchable fix. Perhaps > we could consider redesigning the rules around REWIND in a future > version, but that's not where to focus the bug fix effort.
So, if I can summarize how we got here, as best I understand it: 0. The historic behavior of the executor is to assume it's OK to leak resources for the lifetime of the query. Nodes that are executed to completion generally do some cleanup, but we feel free (as under Limit) to just stop executing a node without giving it any hint that it should release resources. So a Sort may hold onto a terabyte of memory and an index scan may keep holding a pin even after there's no theoretical way of ever needing those resources again, and we just don't care. 1. Parallel query made that perhaps-already-shaky assumption a lot more problematic. Partly that's because workers are a a more scarce and considerably heavier resource than anything else, and moreover act as a container for anything else, so whatever you were leaking before, you can now leak N times more of it, plus N processes, until the end of the query. However, there's a correctness reason too, which is that when a node has a copy in the leader and a copy in every worker, each copy has its own instrumentation data (startup time, run time, nloops, etc) and we can only fold all that together once the node is done executing, because it's really hard to add up a bunch of numbers before the numbers are done changing. We could've made the instrumentation shared throughout, but if we had, we could have contention for updating the instrumentation data, which seems like it'd be bad. 2. To fix that correctness problem, we decided to try to shut down the node under a limit node when we're done with it (commit 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this looks fundamentally necessary to me. If you're going to have N separate copies of the instrumentation, and you want to add them up when you're done, then you have to decide to be done at some point; otherwise you don't know when to add them up, and maybe won't add them up at all, and then you'll be sad. This does not mean that the exact timing couldn't be changed somehow, but if you want a correct implementation, you have to shut down Limit's sub-node after you're done executing it (so that you can get the instrumentation data from the workers after it's final) and before you start destroying DSM segments and stuff (so that you get the instrumentation data from the workers before it vanishes). 3. The aforementioned commit turned out to be buggy in at least to two ways, precisely because it didn't do a good enough job predicting when the Limit needed to be shut down. First, there was commit 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact that you could hit the Limit and then back up. Second, there's the present issue, where the Limit gets rescanned. So, given all that, if we want to adopt Tom's position that we should always cater to a possible rescan, then we're going to have to rethink the way that instrumentation data gets consolidated from workers into the leader in such a way that we can consolidate multiple times without ending up with the wrong answer. The other option is to do what I understand Amit and Thomas to be proposing, which is to do a better job identifying the case where we're "done for good" and can trigger the shutdown fearlessly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company