Hi,
haproxy retries with redispatch are not working consistently.
I've configured haproxy (3.0.5 running as a container on ECS/fargate)
with retries > 0 to redispatch on errors.
The goals of my haproxy config are:
- retry 2 times
- retry on most retryable errors
- redispatch on every retry
Retries with redispatch are working some of the time, but not every
time. It's only retrying with redispatch about 20% of the time.
My config is:
| global
| log stdout len 4096 format raw local0
| maxconn 2000
| tune.bufsize 262144
| user haproxy
| group haproxy
| stats socket ipv4@127.0.0.1:9999 <http://ipv4@127.0.0.1:9999> level
admin expose-fd listeners
| lua-load /tmp/haproxy_files/static_404.lua
|
| defaults all
| log global
| mode http
| option httplog
| timeout client 50s
|
| defaults backends from all
| option httpchk
| http-response del-header server
|
| resolvers awsdns
| nameserver dns1 10.10.0.2:53 <http://10.10.0.2:53>
| accepted_payload_size 8192 # allow larger DNS payloads
|
| listen stats from all
| bind :8080
| mode http
| http-request use-service prometheus-exporter if { path /metrics }
| monitor-uri /haproxy-stats-health
| stats enable # Enable stats page
| stats hide-version # Hide HAProxy version
| stats realm Haproxy\ Statistics # Title text for popup window
| stats uri /haproxy-stats # Stats URI
|
| #############################################################
| # main endpoint for the services on the ECS cluster
| frontend mh-serverless from all
| bind *:8080
| # simple health check for frontend LB to haproxy
| monitor-uri /haproxy-health
|
| option http-server-close
| option forwardfor
| http-request add-header x-forwarded-proto https
| http-request add-header x-forwarded-port 443
| capture request header x-amzn-trace-id len 256
| capture request header x-forwarded-for len 256
| capture request header host len 256
| capture request header referer len 256
| capture request header user-agent len 40
| capture request header x-custom-header len 5
| # use JSON lines format for easier parsing in cloudwatch and datadog
| log-format
'{"timestamp":"%[date(0,ms),ms_ltime(%Y-%m-%dT%H:%M:%S.%3N%z)]","x-amzn-trace-id":"%[capture.req.hdr(0)]","x-forwarded-for":"%[capture.req.hdr(1)]","http_method":"%HM","uri_path":"%HPO","query_args":"%HQ","http_version":"%HV","http_status_code":"%ST","termination-state":"%ts","latency":%Ta,"response_length":%B,"referer":"%[capture.req.hdr(3)]","backend_name":"%b","host":"meridianlink-prod","service":"haproxy-router","backend":{"name":"%b","concurrent_connections":%bc,"source_ip":"%bi","source_port":"%bp","queue":%bq},"bytes":{"read":%B,"uploaded":%U},"captured_headers":{"request":{"x-forwarded-for":"%[capture.req.hdr(1)]","host":"%[capture.req.hdr(2)]","referer":"%[capture.req.hdr(3)]","user-agent":"%[capture.req.hdr(4)]","x-custom-header":"%[capture.req.hdr(5)]"},"response":"%hs"},"client":{"ip":"%ci","port":"%cp"},"frontend":{"name":"%f","concurrent_connections":%fc,"ip":"%fi","port":"%fp","name_transport":"%ft","log_counter":%lc},"hostname":"%H","http":{"method":"%HM","request_uri_without_query_string":"%HP","request_uri_query_string":"%HQ","request_uri":"%HU","version":"%HV","status_code":%ST,"request":"%r","retries":%rc},"process_concurrent_connections":%ac,"request_counter":%rt,"server":{"name":"%s","concurrent_connections":%sc,"ip":"%si","port":"%sp","queue":%sq},"timers":{"tr":"%tr","Ta":%Ta,"Tc":%Tc,"Td":%Td,"Th":%Th,"Ti":%Ti,"Tq":%Tq,"TR":%TR,"Tr":%Tr,"Tt":%Tt,"Tw":%Tw}}'
|
| use_backend clientA-app if { path_beg /api/clientA/app }
| use_backend clientB-app if { path_beg /api/clientB/app }
| use_backend clientC-app if { path_beg /api/clientC/app }
| # ..... and so on ...
|
| default_backend static_404
|
| #############################################################
|
| backend clientA-app from backends
| http-check send meth GET uri /health
| balance roundrobin
| # how long to wait for a successful connection
| timeout connect 3s
| # how long to wait for a response to an http request (7s is just slightly
longer than the typical FIST penalty)
| timeout server 7s
| # how long to wait for a response to health check
| timeout check 14s
| # total number of retries
| # this value is appropriate when max number of tasks will be 3
| retries 2
| # redispatch on every retry
| option redispatch 1
| # explicitly list types of errors in order to not include 4XX, 500, 501,
etc.
| retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
| server-template svc 3 clientA-app.ecs-cluster-prod.local:5000 check inter
15s fall 5 rise 1 resolvers awsdns init-addr none
|
| backend clientB-app from backends
| http-check send meth GET uri /health
| balance roundrobin
| # how long to wait for a successful connection
| timeout connect 3s
| # how long to wait for a response to an http request (7s is just slightly
longer than the typical FIST penalty)
| timeout server 7s
| # how long to wait for a response to health check
| timeout check 14s
| # total number of retries
| # this value is appropriate when max number of tasks will be 3
| retries 2
| # redispatch on every retry
| option redispatch 1
| # explicitly list types of errors in order to not include 4XX, 500, 501,
etc.
| retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
| server-template svc 3 clientB-app.ecs-cluster-prod.local:5000 check inter
15s fall 5 rise 1 resolvers awsdns init-addr none
|
| backend clientC-app from backends
| http-check send meth GET uri /health
| balance roundrobin
| # how long to wait for a successful connection
| timeout connect 3s
| # how long to wait for a response to an http request (7s is just slightly
longer than the typical FIST penalty)
| timeout server 7s
| # how long to wait for a response to health check
| timeout check 14s
| # total number of retries
| # this value is appropriate when max number of tasks will be 3
| retries 2
| # redispatch on every retry
| option redispatch 1
| # explicitly list types of errors in order to not include 4XX, 500, 501,
etc.
| retry-on 0rtt-rejected response-timeout junk-response empty-response
conn-failure 502 503 504
| server-template svc 3 clientC-app.ecs-cluster-prod.local:5000 check inter
15s fall 5 rise 1 resolvers awsdns init-addr none
|
| #############################################################
|
| backend static_404 from all
| http-request use-service lua.static_404
However, these are the logs that I'm seeing:
|
#timestamp,hostname,server(%b),uri_path(%HPO),status_code(%ST),timers.Ta(%Ta),read(%B),uploaded(%U),retries(%rc),termination-state(%ts)
|
2024-10-13T17:37:00.916,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,7068,198,46427,0,sH
|
2024-10-13T17:28:19.880,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,21003,198,50147,+2,sH
|
2024-10-13T17:28:08.873,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7011,198,50245,0,sH
|
2024-10-13T12:30:24.284,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7000,198,47781,0,sH
|
2024-10-13T03:19:21.817,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,7003,198,41351,0,sH
|
2024-10-13T02:40:15.121,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,21010,198,53686,+2,sH
|
2024-10-13T02:39:20.349,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7002,198,67650,0,sH
Based on the config above, I'm expecting haproxy to retry on all
of the errors listed in retry-on and for all request payloads <
256K, but that's not what I'm seeing. Notice that sometimes there
will be retries, and the timers.Ta is consistent with the total
number of retries, but most of the time haproxy is not retrying.
This is not only happening on 504s; it's happening on 503s as well.
Not sure on other other types since those are a little harder to
induce. I've tried playing with the tune.bufsize parameter since
our reqest payloads can be quite large (max up to 1MB recently).
It's currently set it to 256K for now since that covers the 95%-ile.
The default 16K so initially we were getting almost zero retries,
but I've also tried setting tune.bufsize to 1MB, which did not
change anything either. I've also played with various retry-on
settings (like all-retryable-errors) but same thing.