Hi,

I've encountered a situation where HAProxy does not fail over from a server
it has marked as DOWN to a backup server it has marked as UP. I have
managed to reproduce this consistently in a test environment, here is the
(I hope) relevant configuration

defaults http_defaults_1
  mode http
  hash-type consistent
  hash-balance-factor 150
  maxconn 4096
  http-reuse safe
  option abortonclose
  option allbackups
  timeout check 2s
  timeout connect 15s
  timeout client 30s
  timeout server 30s

backend server-backend from http_defaults_1
  balance roundrobin
  # We keep at the default because retrying anything else risks duplicating
events on these servers
  retry-on conn-failure
  server server-006 fd8a...0006:8080 maxconn 32 check inter 250 alpn h2
  server server-012 fd8a...0012:8080 maxconn 32 check backup alpn h2

This is with HAproxy 3.0.3-95a607c running on a VPS with 16GB RAM (we have
seen the same issue on a dedicated server with 64GB though), which is
running Ubuntu 24.04.1 with the default net.ipv4.tcp_retries2 = 15,
net.ipv4.tcp_syn_retries = 6, and tcp_fin_timeout of 60s (these also apply
to IPv6 connections). CPU usage is under 20%.

Once I have a small load running (20 req/sec), if I make the 8080 port on
server-006 temporarily unavailable by restarting the service on it, HAProxy
logs the transition of server-006 to DOWN (and the stats socket and
server_check_failure metrics show the same) and server-012 picks up
requests as expected, with no 5xx errors recorded.
However if I instead kill the server-006 machine (so that a TCP health
check to it with `nc` fails with a timeout rather than a connection
refused), the server is marked as DOWN as before, but all requests coming
in the HAProxy for that backend return a 5xx error to the client after 15s
(the timeout connect) and server-012 does not receive any requests despite
showing as UP in the stats socket. This "not failed over" state of 100% 5xx
errors goes on for minutes, sometimes hours, and how long seems to depend
on the load. Reducing the load to a few requests a minute avoids the issue
(and dropping the load when it is in the "not failed over" state also fixes
the issue). I would have expected the <=32 in flight requests to have been
redispatched to 012 as soon as 006 was marked down, and the other <=4096-32
requests to have been held in the frontend queue until the backend ones
were finished, but understandably things get more complicated when you
consider timeouts.

The haproxy_backend_current_queue metric which is normally at 0 with that
load will jump up to around 200 during this issue, and drop straight down
to 0 again when it resolves, there is no draining period.
Looking at `ss -6tnp` during the issue there are only SYN-SENT messages to
both servers, no ESTAB until the issue resolves. I tried dropping
net.ipv4.tcp_retries2 down to 3 and tcp_syn_retries to 2, however this made
no difference.
Hoping someone can point me in the right direction to understand what is
happening here.

Thanks,

Miles

Reply via email to