Le 16/10/2024 à 23:10, Allen Myers a écrit :
Hi,


haproxy retries with redispatch are not working consistently.

I've configured haproxy (3.0.5 running as a container on ECS/fargate)
with retries > 0 to redispatch on errors.

The goals of my haproxy config are:
   - retry 2 times
   - retry on most retryable errors
   - redispatch on every retry

Retries with redispatch are working some of the time, but not every
time. It's only retrying with redispatch about 20% of the time.

My config is:

|  global
|      log stdout len 4096 format raw local0
|      maxconn 2000
|      tune.bufsize 262144
|      user haproxy
|      group haproxy
|      stats socket ipv4@127.0.0.1:9999 <http://ipv4@127.0.0.1:9999>  level admin  expose-fd listeners
|      lua-load /tmp/haproxy_files/static_404.lua
|
|  defaults all
|      log     global
|      mode    http
|      option  httplog
|      timeout client  50s
|
|  defaults backends from all
|      option httpchk
|      http-response del-header server
|
|  resolvers awsdns
|       nameserver dns1 10.10.0.2:53 <http://10.10.0.2:53>
|       accepted_payload_size 8192 # allow larger DNS payloads
|
|  listen stats from all
|      bind :8080
|      mode http
|      http-request use-service prometheus-exporter if { path /metrics }
|      monitor-uri /haproxy-stats-health
|      stats enable  # Enable stats page
|      stats hide-version  # Hide HAProxy version
|      stats realm Haproxy\ Statistics  # Title text for popup window
|      stats uri /haproxy-stats  # Stats URI
|
|  #############################################################
|  # main endpoint for the services on the ECS cluster
|  frontend mh-serverless from all
|      bind *:8080
|      # simple health check for frontend LB to haproxy
|      monitor-uri /haproxy-health
|
|      option http-server-close
|      option forwardfor
|      http-request add-header x-forwarded-proto https
|      http-request add-header x-forwarded-port 443
|      capture request header x-amzn-trace-id len 256
|      capture request header x-forwarded-for len 256
|      capture request header host len 256
|      capture request header referer len 256
|      capture request header user-agent len 40
|      capture request header x-custom-header len 5
|      # use JSON lines format for easier parsing in cloudwatch and datadog
|      log-format '{"timestamp":"%[date(0,ms),ms_ltime(%Y-%m-%dT%H:%M:%S.%3N%z)]","x-amzn-trace-id":"%[capture.req.hdr(0)]","x-forwarded-for":"%[capture.req.hdr(1)]","http_method":"%HM","uri_path":"%HPO","query_args":"%HQ","http_version":"%HV","http_status_code":"%ST","termination-state":"%ts","latency":%Ta,"response_length":%B,"referer":"%[capture.req.hdr(3)]","backend_name":"%b","host":"meridianlink-prod","service":"haproxy-router","backend":{"name":"%b","concurrent_connections":%bc,"source_ip":"%bi","source_port":"%bp","queue":%bq},"bytes":{"read":%B,"uploaded":%U},"captured_headers":{"request":{"x-forwarded-for":"%[capture.req.hdr(1)]","host":"%[capture.req.hdr(2)]","referer":"%[capture.req.hdr(3)]","user-agent":"%[capture.req.hdr(4)]","x-custom-header":"%[capture.req.hdr(5)]"},"response":"%hs"},"client":{"ip":"%ci","port":"%cp"},"frontend":{"name":"%f","concurrent_connections":%fc,"ip":"%fi","port":"%fp","name_transport":"%ft","log_counter":%lc},"hostname":"%H","http":{"method":"%HM","request_uri_without_query_string":"%HP","request_uri_query_string":"%HQ","request_uri":"%HU","version":"%HV","status_code":%ST,"request":"%r","retries":%rc},"process_concurrent_connections":%ac,"request_counter":%rt,"server":{"name":"%s","concurrent_connections":%sc,"ip":"%si","port":"%sp","queue":%sq},"timers":{"tr":"%tr","Ta":%Ta,"Tc":%Tc,"Td":%Td,"Th":%Th,"Ti":%Ti,"Tq":%Tq,"TR":%TR,"Tr":%Tr,"Tt":%Tt,"Tw":%Tw}}'
|
|      use_backend clientA-app if { path_beg /api/clientA/app }
|      use_backend clientB-app if { path_beg /api/clientB/app }
|      use_backend clientC-app if { path_beg /api/clientC/app }
|      # ..... and so on ...
|
|      default_backend static_404
|
|  #############################################################
|
|  backend clientA-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500, 501, 
etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response conn-failure 502 503 504 |      server-template svc 3 clientA-app.ecs-cluster-prod.local:5000 check inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  backend clientB-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500, 501, 
etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response conn-failure 502 503 504 |      server-template svc 3 clientB-app.ecs-cluster-prod.local:5000 check inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  backend clientC-app from backends
|      http-check send meth GET uri /health
|      balance roundrobin
|      # how long to wait for a successful connection
|      timeout connect 3s
|      # how long to wait for a response to an http request (7s is just slightly longer than the typical FIST penalty)
|      timeout server 7s
|      # how long to wait for a response to health check
|      timeout check 14s
|      # total number of retries
|      # this value is appropriate when max number of tasks will be 3
|      retries 2
|      # redispatch on every retry
|      option redispatch 1
|      # explicitly list types of errors in order to not include 4XX, 500, 501, 
etc.
|      retry-on 0rtt-rejected response-timeout junk-response empty-response conn-failure 502 503 504 |      server-template svc 3 clientC-app.ecs-cluster-prod.local:5000 check inter 15s fall 5 rise 1 resolvers awsdns init-addr none
|
|  #############################################################
|
|  backend static_404 from all
|      http-request use-service lua.static_404


However, these are the logs that I'm seeing:

|  #timestamp,hostname,server(%b),uri_path(%HPO),status_code(%ST),timers.Ta(%Ta),read(%B),uploaded(%U),retries(%rc),termination-state(%ts) |  2024-10-13T17:37:00.916,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,7068,198,46427,0,sH |  2024-10-13T17:28:19.880,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,21003,198,50147,+2,sH |  2024-10-13T17:28:08.873,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7011,198,50245,0,sH |  2024-10-13T12:30:24.284,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7000,198,47781,0,sH |  2024-10-13T03:19:21.817,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,7003,198,41351,0,sH |  2024-10-13T02:40:15.121,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,21010,198,53686,+2,sH |  2024-10-13T02:39:20.349,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7002,198,67650,0,sH

Based on the config above, I'm expecting haproxy to retry on all
of the errors listed in retry-on and for all request payloads <
256K, but that's not what I'm seeing.  Notice that sometimes there
will be retries, and the timers.Ta is consistent with the total
number of retries, but most of the time haproxy is not retrying.
This is not only happening on 504s; it's happening on 503s as well.
Not sure on other other types since those are a little harder to
induce.  I've tried playing with the tune.bufsize parameter since
our reqest payloads can be quite large (max up to 1MB recently).
It's currently set it to 256K for now since that covers the 95%-ile.
The default 16K so initially we were getting almost zero retries,
but I've also tried setting tune.bufsize to 1MB, which did not
change anything either.  I've also played with various retry-on
settings (like all-retryable-errors) but same thing.


Hi,

As you noticed, the request must fit in a buffer to perform L7-retries. But it must also be fully received. You should probably add the "http-buffer-request" option or use "wait-for-body" action if you need more control.

I recently noticed it was not mentionned in the "retry-on" documentation. And it is not obvious. I must update the doc. Another solution would be to automatically wait for the request body if some L7-retries are enabled.

--
Christopher Faulet

Reply via email to