On Fri, 6 Mar 2026 at 20:51, Heikki Linnakangas <[email protected]> wrote:
> > In theory we could reduce the window for the race, by having all
> > frontend tools use async connections and have the main thread wait for
> > either the self-pipe or a cancel. That way it would be more similar to
> > the previous signal code in behaviour. That's a much bigger lift though,
> > i.e. all PQexec and PQgetResult calls would need to be modified. My
> > proposed change doesn't require changing the callsites at all.
>
> Yeah, it does have that advantage..
I let Claude Code take a stab at a POC for doing the same thread
approach. So I could get a more accurate feeling of how big this lift
would be. It's a much bigger change, but the general design is
relatively straight forward. It's attached as the nocfbot patch (it's
not built on top of any of the other patches, it's a separate one).
> One simple thing we could is to remember the "generation" in the signal
> handler, and store it in another global variable ("cancelledGeneration"
> or such). In the cancel thread, check that the generation matches;
> otherwise the thread is about to send a cancellation to a query that
> already finished, and should not send it.
>In the cancel thread, check that the generation matches;
otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.
I took a look at this, and attached a fixup patch that does this. It
uses C11 atomics because locks cannot be taken in the signal handler
and the signal handler needs to read/write variables from/to two
different threads. I'm not sure if it pulls it's own weight though. It
seems a really unlikely scenario where the signal handler is fired
during one generation, but then the cancel thread wakes up during
another. I'm not sure if we should care about it. And I actually think
it could make us miss cancel a SIGINT for non-interactive use cases of
psql.
> I worry how this behaves if establishing the cancel connection gets
> stuck for a long time. Because of a network hiccup, for example. That's
> also not a new problem though; it's perhaps even worse today, if the
> signal handler gets stuck for a long time, trying to establish the
> connection. Still, would be good to do some testing with a bad network.
To make sure we're talking about the same situation, let me summarize
it differently: Establishing the connection for the cancel is slow,
but the actual query connection is still fast.
I think this is an interesting case to consider (much more interesting
than the kind of race the additional generation check could protect
against). First of all because I think it can definitely happen.
Especially with SSL the cancel needs several round trips, while a
query on an already established connection only needs one direction
latency. I can definitely see how this could cause out-of-order
arrival even on stable high-latency networks. Especilaly if there's
also some unfortunate packet drops.
And as you suggested the failure modes are different:
- With master, psql will become unresponsive until the client gets a
response from the server (or tcp timeouts are hit)
- With this patchset, a later query will get cancelled.
I think for interactive psql usage both are annoying, but both are not
the end of the world. I think I would personally prefer the current
master behaviour. I'm not sure preserving it is worth all the
additional code changes though to make all the applications use
non-blocking APIs. In any case SSL for cancel keys is definitely worth
the patchset behaviour to me (even though it sounds slightly worse).
For non-interactive use (i.e. running scripts in psql or other
frontend tools like vacuumdb). I don't think this situation applies.
You want whatever query to be cancelled.