On Tue, Sep 20, 2022 at 3:06 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Tue, Sep 20, 2022 at 2:35 PM James Coleman <jtc...@gmail.com> wrote: > > Either I'm missing something (and/or this was fixed in a later PG > > version), but I don't think this is how it works. We have this > > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > > and statement_timeout set to 30s, but when a statement times out we do > > not get the plan logged with auto-explain. > > I think you're correct. auto_explain uses the ExecutorEnd hook, but > that hook will not be fired in the case of an error. Indeed, if an > error has already been thrown, it would be unsafe to try to > auto-explain anything. For instance -- and this is just one problem of > probably many -- ExplainTargetRel() performs catalog lookups, which is > not OK in a failed transaction. > > To make this work, I think you'd need a hook that fires just BEFORE > the error is thrown. However, previous attempts to introduce hooks > into ProcessInterrupts() have not met with a wam response from Tom, so > it might be a tough sell. And maybe for good reason. I see at least > two problems. The first is that explaining a query is a pretty > complicated operation that involves catalog lookups and lots of > complicated stuff, and I don't think that it would be safe to do all > of that at any arbitrary point in the code where ProcessInterrupts() > happened to fire. What if one of the syscache lookups misses the cache > and we have to open the underlying catalog? Then > AcceptInvalidationMessages() will fire, but we don't currently assume > that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if > the running query has called a user-defined function or procedure > which is running DDL which is in the middle of changing catalog state > for some relation involved in the query at the moment that the > statement timeout arrives? I feel like there are big problems here. > > The other problem I see is that ProcessInterrupts() is our mechanism > for allowing people to escape from queries that would otherwise run > forever by hitting ^C. But what if the explain code goes crazy and > itself needs to be interrupted, when we're already inside > ProcessInterrupts()? Maybe that would work out OK... or maybe it > wouldn't. I'm not really sure. But it's another reason to be very, > very cautious about putting complicated logic inside > ProcessInterrupts().
This is exactly the kind of background I was hoping someone would provide; thank you, Robert. It seems like one could imagine addressing all of these by having one of: - A safe explain (e.g., disallow catalog access) that is potentially missing information. - A safe way to interrupt queries such as "safe shutdown" of a node (e.g., a seq scan could stop returning tuples early) and allow a configurable buffer of time after the statement timeout before firing a hard abort of the query (and transaction). Both of those seem like a significant amount of work. Alternatively I wonder if it's possible (this would maybe assume no catalog changes in the current transaction -- or at least none that would be referenced by the current query) to open a new transaction (with the same horizon information) and duplicate the plan over to that transaction and run the explain there. This way you do it *after* the error is raised. That's some serious spit-balling -- I'm not saying that's doable, just trying to imagine how one might comprehensively address the concerns. Does any of that seem at all like a path you could imagine being fruitful? Thanks, James Coleman