Hi,

haproxy retries with redispatch are not working consistently.

I've configured haproxy (3.0.5 running as a container on ECS/fargate)
with retries > 0 to redispatch on errors.

The goals of my haproxy config are:
  - retry 2 times
  - retry on most retryable errors
  - redispatch on every retry

Retries with redispatch are working some of the time, but not every
time. It's only retrying with redispatch about 20% of the time.

My config is:

|  global
|      log stdout len 4096 format raw local0
|      maxconn 2000
|      tune.bufsize 262144
|      user haproxy
|      group haproxy
|      stats socket ipv4@127.0.0.1:9999  level admin  expose-fd listeners
|      lua-load /tmp/haproxy_files/static_404.lua
|
|  defaults all
|      log     global
|      mode    http
|      option  httplog
|      timeout client  50s
|
|  defaults backends from all
|      option httpchk
|      http-response del-header server
|
|  resolvers awsdns
|       nameserver dns1 10.10.0.2:53
|       accepted_payload_size 8192 # allow larger DNS payloads
|
|  listen stats from all
|      bind :8080
|      mode http
|      http-request use-service prometheus-exporter if { path /metrics }
|      monitor-uri /haproxy-stats-health
|      stats enable  # Enable stats page
|      stats hide-version  # Hide HAProxy version
|      stats realm Haproxy\ Statistics  # Title text for popup window
|      stats uri /haproxy-stats  # Stats URI
|
|  #############################################################
|  # main endpoint for the services on the ECS cluster
|  frontend mh-serverless from all
|      bind *:8080
|      # simple health check for frontend LB to haproxy
|      monitor-uri /haproxy-health
|
|      option http-server-close
|      option forwardfor
|      http-request add-header x-forwarded-proto https
|      http-request add-header x-forwarded-port 443
|      capture request header x-amzn-trace-id len 256
|      capture request header x-forwarded-for len 256
|      capture request header host len 256
|      capture request header referer len 256
|      capture request header user-agent len 40
|      capture request header x-custom-header len 5
|      # use JSON lines format for easier parsing in cloudwatch and datadog
|      log-format
'{"timestamp":"%[date(0,ms),ms_ltime(%Y-%m-%dT%H:%M:%S.%3N%z)]","x-amzn-trace-id":"%[capture.req.hdr(0)]","x-forwarded-for":"%[capture.req.hdr(1)]","http_method":"%HM","uri_path":"%HPO","query_args":"%HQ","http_version":"%HV","http_status_code":"%ST","termination-state":"%ts","latency":%Ta,"response_length":%B,"referer":"%[capture.req.hdr(3)]","backend_name":"%b","host":"meridianlink-prod","service":"haproxy-router","backend":{"name":"%b","concurrent_connections":%bc,"source_ip":"%bi","source_port":"%bp","queue":%bq},"bytes":{"read":%B,"uploaded":%U},"captured_headers":{"request":{"x-forwarded-for":"%[capture.req.hdr(1)]","host":"%[capture.req.hdr(2)]","referer":"%[capture.req.hdr(3)]","user-agent":"%[capture.req.hdr(4)]","x-custom-header":"%[capture.req.hdr(5)]"},"response":"%hs"},"client":{"ip":"%ci","port":"%cp"},"frontend":{"name":"%f","concurrent_connections":%fc,"ip":"%fi","port":"%fp","name_transport":"%ft","log_counter":%lc},"hostname":"%H","http":{"method":"%HM","request_uri_without_query_string":"%HP","request_uri_query_string":"%HQ","request_uri":"%HU","version":"%HV","status_code":%ST,"request":"%r","retries":%rc},"process_concurrent_connections":%ac,"request_counter":%rt,"server":{"name":"%s","concurrent_connections":%sc,"ip":"%si","port":"%sp","queue":%sq},"timers":{"tr":"%tr","Ta":%Ta,"Tc":%Tc,"Td":%Td,"Th":%Th,"Ti":%Ti,"Tq":%Tq,"TR":%TR,"Tr":%Tr,"Tt":%Tt,"Tw":%Tw}}'
|
|      use_backend clientA-app if { path_beg /api/clientA/app }
|      use_backend clientB-app if { path_beg /api/clientB/app }
|      use_backend clientC-app if { path_beg /api/clientC/app }
|      # ..... and so on ...
|
|      default_backend static_404
|
|  #############################################################
|
|  backend clientA-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just
slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500,
501, etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
|      server-template svc 3 clientA-app.ecs-cluster-prod.local:5000 check
inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  backend clientB-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just
slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500,
501, etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
|      server-template svc 3 clientB-app.ecs-cluster-prod.local:5000 check
inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  backend clientC-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just
slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500,
501, etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
|      server-template svc 3 clientC-app.ecs-cluster-prod.local:5000 check
inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  #############################################################
|
|  backend static_404 from all
|      http-request use-service lua.static_404


However, these are the logs that I'm seeing:

|
 
#timestamp,hostname,server(%b),uri_path(%HPO),status_code(%ST),timers.Ta(%Ta),read(%B),uploaded(%U),retries(%rc),termination-state(%ts)
|
 
2024-10-13T17:37:00.916,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,7068,198,46427,0,sH
|
 
2024-10-13T17:28:19.880,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,21003,198,50147,+2,sH
|
 
2024-10-13T17:28:08.873,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7011,198,50245,0,sH
|
 
2024-10-13T12:30:24.284,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7000,198,47781,0,sH
|
 
2024-10-13T03:19:21.817,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,7003,198,41351,0,sH
|
 
2024-10-13T02:40:15.121,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,21010,198,53686,+2,sH
|
 
2024-10-13T02:39:20.349,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7002,198,67650,0,sH

Based on the config above, I'm expecting haproxy to retry on all
of the errors listed in retry-on and for all request payloads <
256K, but that's not what I'm seeing.  Notice that sometimes there
will be retries, and the timers.Ta is consistent with the total
number of retries, but most of the time haproxy is not retrying.
This is not only happening on 504s; it's happening on 503s as well.
Not sure on other other types since those are a little harder to
induce.  I've tried playing with the tune.bufsize parameter since
our reqest payloads can be quite large (max up to 1MB recently).
It's currently set it to 256K for now since that covers the 95%-ile.
The default 16K so initially we were getting almost zero retries,
but I've also tried setting tune.bufsize to 1MB, which did not
change anything either.  I've also played with various retry-on
settings (like all-retryable-errors) but same thing.

And here's the output of haproxy -vv:

=====================================================
HAProxy version 3.0.5-8e879a5 2024/09/19 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.5.html
Running on: Linux 5.10.226-214.879.amzn2.x86_64 #1 SMP Tue Sep 24 01:40:52
UTC 2024 x86_64
Build options :
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -fwrapv
  OPTIONS = USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_PROMEX=1
USE_PCRE2=1 USE_PCRE2_JIT=1
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY
+CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE
-LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH
-MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC
-OPENSSL_WOLFSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL
-PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT
+SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD
+THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256,
default=2).
Built with OpenSSL version : OpenSSL 3.0.14 4 Jun 2024
Running on OpenSSL version : OpenSSL 3.0.14 4 Jun 2024
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.4.4
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with PCRE2 version : 10.42 2022-12-11
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 12.2.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
=====================================================

thanks for your assistance!

- Allen

Reply via email to