It seems the man page of TCP_USER_TIMEOUT does not align with 
reality then. When I use it on my local machine it is effectively used
as a connection timeout too. The second command times out after 
two seconds:

sudo iptables -A INPUT -p tcp --destination-port 5432 -j DROP
psql 'host=localhost tcp_user_timeout=2000'

The keepalive settings only apply once you get to the recv however. And yes, 
it is pretty unlikely for the connection to break right when it is waiting for 
data.
But it has happened for us. And when it happens it is really bad, because
the process will be blocked forever. Since it is a blocking call.

After investigation when this happened it seemed to be a combination of a few
things making this happen: 
1. The way citus uses cancelation requests: A Citus query on the coordinator 
creates 
   multiple connections to a worker and with 2PC for distributed transactions. 
If one 
   connection receives an error it sends a cancel request for all others.
2. When a machine is under heavy CPU or memory pressure things don't work
   well: 
   i. errors can occur pretty frequently, causing lots of cancels to be sent by 
Citus.
   ii. postmaster can be slow in handling new cancelation requests.
   iii. Our failover system can think the node is down, because health checks 
are
      failing.
3. Our failover system effectively cuts the power and the network of the 
primary 
   when it triggers a fail over to the secondary

This all together can result in a cancel request being interrupted right at 
that 
wrong moment. And when it happens a distributed query on the Citus 
coordinator, becomes blocked forever. We've had queries stuck in this state 
for multiple days. The only way to get out of it at that point is either by 
restarting
postgres or manually closing the blocked socket (either with ss or gdb).

Jelte

Reply via email to