Re: Auto explain after query timeout

James Coleman Tue, 20 Sep 2022 14:09:17 -0700

On Tue, Sep 20, 2022 at 3:06 PM Robert Haas <robertmh...@gmail.com> wrote:
>
> On Tue, Sep 20, 2022 at 2:35 PM James Coleman <jtc...@gmail.com> wrote:
> > Either I'm missing something (and/or this was fixed in a later PG
> > version), but I don't think this is how it works. We have this
> > specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> > and statement_timeout set to 30s, but when a statement times out we do
> > not get the plan logged with auto-explain.
>
> I think you're correct. auto_explain uses the ExecutorEnd hook, but
> that hook will not be fired in the case of an error. Indeed, if an
> error has already been thrown, it would be unsafe to try to
> auto-explain anything. For instance -- and this is just one problem of
> probably many -- ExplainTargetRel() performs catalog lookups, which is
> not OK in a failed transaction.
>
> To make this work, I think you'd need a hook that fires just BEFORE
> the error is thrown. However, previous attempts to introduce hooks
> into ProcessInterrupts() have not met with a wam response from Tom, so
> it might be a tough sell. And maybe for good reason. I see at least
> two problems. The first is that explaining a query is a pretty
> complicated operation that involves catalog lookups and lots of
> complicated stuff, and I don't think that it would be safe to do all
> of that at any arbitrary point in the code where ProcessInterrupts()
> happened to fire. What if one of the syscache lookups misses the cache
> and we have to open the underlying catalog? Then
> AcceptInvalidationMessages() will fire, but we don't currently assume
> that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
> the running query has called a user-defined function or procedure
> which is running DDL which is in the middle of changing catalog state
> for some relation involved in the query at the moment that the
> statement timeout arrives? I feel like there are big problems here.
>
> The other problem I see is that ProcessInterrupts() is our mechanism
> for allowing people to escape from queries that would otherwise run
> forever by hitting ^C. But what if the explain code goes crazy and
> itself needs to be interrupted, when we're already inside
> ProcessInterrupts()? Maybe that would work out OK... or maybe it
> wouldn't. I'm not really sure. But it's another reason to be very,
> very cautious about putting complicated logic inside
> ProcessInterrupts().


This is exactly the kind of background I was hoping someone would
provide; thank you, Robert.

It seems like one could imagine addressing all of these by having one of:

- A safe explain (e.g., disallow catalog access) that is potentially
missing information.
- A safe way to interrupt queries such as "safe shutdown" of a node
(e.g., a seq scan could stop returning tuples early) and allow a
configurable buffer of time after the statement timeout before firing
a hard abort of the query (and transaction).

Both of those seem like a significant amount of work.

Alternatively I wonder if it's possible (this would maybe assume no
catalog changes in the current transaction -- or at least none that
would be referenced by the current query) to open a new transaction
(with the same horizon information) and duplicate the plan over to
that transaction and run the explain there. This way you do it *after*
the error is raised. That's some serious spit-balling -- I'm not
saying that's doable, just trying to imagine how one might
comprehensively address the concerns.

Does any of that seem at all like a path you could imagine being fruitful?

Thanks,
James Coleman

Re: Auto explain after query timeout

Reply via email to