HAProxy not failing over under load

Miles Hampson Mon, 04 Nov 2024 23:55:10 -0800

Hi,

I've encountered a situation where HAProxy does not fail over from a server
it has marked as DOWN to a backup server it has marked as UP. I have
managed to reproduce this consistently in a test environment, here is the
(I hope) relevant configuration


defaults http_defaults_1
  mode http
  hash-type consistent
  hash-balance-factor 150
  maxconn 4096
  http-reuse safe
  option abortonclose
  option allbackups
  timeout check 2s
  timeout connect 15s
  timeout client 30s
  timeout server 30s

backend server-backend from http_defaults_1
  balance roundrobin
  # We keep at the default because retrying anything else risks duplicating
events on these servers
  retry-on conn-failure
  server server-006 fd8a...0006:8080 maxconn 32 check inter 250 alpn h2
  server server-012 fd8a...0012:8080 maxconn 32 check backup alpn h2

This is with HAproxy 3.0.3-95a607c running on a VPS with 16GB RAM (we have
seen the same issue on a dedicated server with 64GB though), which is
running Ubuntu 24.04.1 with the default net.ipv4.tcp_retries2 = 15,
net.ipv4.tcp_syn_retries = 6, and tcp_fin_timeout of 60s (these also apply
to IPv6 connections). CPU usage is under 20%.

Once I have a small load running (20 req/sec), if I make the 8080 port on
server-006 temporarily unavailable by restarting the service on it, HAProxy
logs the transition of server-006 to DOWN (and the stats socket and
server_check_failure metrics show the same) and server-012 picks up
requests as expected, with no 5xx errors recorded.
However if I instead kill the server-006 machine (so that a TCP health
check to it with `nc` fails with a timeout rather than a connection
refused), the server is marked as DOWN as before, but all requests coming
in the HAProxy for that backend return a 5xx error to the client after 15s
(the timeout connect) and server-012 does not receive any requests despite
showing as UP in the stats socket. This "not failed over" state of 100% 5xx
errors goes on for minutes, sometimes hours, and how long seems to depend
on the load. Reducing the load to a few requests a minute avoids the issue
(and dropping the load when it is in the "not failed over" state also fixes
the issue). I would have expected the <=32 in flight requests to have been
redispatched to 012 as soon as 006 was marked down, and the other <=4096-32
requests to have been held in the frontend queue until the backend ones
were finished, but understandably things get more complicated when you
consider timeouts.

The haproxy_backend_current_queue metric which is normally at 0 with that
load will jump up to around 200 during this issue, and drop straight down
to 0 again when it resolves, there is no draining period.
Looking at `ss -6tnp` during the issue there are only SYN-SENT messages to
both servers, no ESTAB until the issue resolves. I tried dropping
net.ipv4.tcp_retries2 down to 3 and tcp_syn_retries to 2, however this made
no difference.
Hoping someone can point me in the right direction to understand what is
happening here.

Thanks,

Miles

HAProxy not failing over under load

Reply via email to