Hi Christopher,

Thank you for your response!

I'll try the 'wait-for-body' and 'http-buffer-request' options.

What exactly do you mean by this?

    "Another solution would be to automatically wait for the request body
if some L7-retries are enabled."

- Allen


On Thu, Oct 17, 2024 at 2:12 AM Christopher Faulet <cfau...@haproxy.com>
wrote:

> Le 16/10/2024 à 23:10, Allen Myers a écrit :
> > Hi,
> >
> >
> > haproxy retries with redispatch are not working consistently.
> >
> > I've configured haproxy (3.0.5 running as a container on ECS/fargate)
> > with retries > 0 to redispatch on errors.
> >
> > The goals of my haproxy config are:
> >    - retry 2 times
> >    - retry on most retryable errors
> >    - redispatch on every retry
> >
> > Retries with redispatch are working some of the time, but not every
> > time. It's only retrying with redispatch about 20% of the time.
> >
> > My config is:
> >
> > |  global
> > |      log stdout len 4096 format raw local0
> > |      maxconn 2000
> > |      tune.bufsize 262144
> > |      user haproxy
> > |      group haproxy
> > |      stats socket ipv4@127.0.0.1:9999 <http://ipv4@127.0.0.1:9999>
>  level
> > admin  expose-fd listeners
> > |      lua-load /tmp/haproxy_files/static_404.lua
> > |
> > |  defaults all
> > |      log     global
> > |      mode    http
> > |      option  httplog
> > |      timeout client  50s
> > |
> > |  defaults backends from all
> > |      option httpchk
> > |      http-response del-header server
> > |
> > |  resolvers awsdns
> > |       nameserver dns1 10.10.0.2:53 <http://10.10.0.2:53>
> > |       accepted_payload_size 8192 # allow larger DNS payloads
> > |
> > |  listen stats from all
> > |      bind :8080
> > |      mode http
> > |      http-request use-service prometheus-exporter if { path /metrics }
> > |      monitor-uri /haproxy-stats-health
> > |      stats enable  # Enable stats page
> > |      stats hide-version  # Hide HAProxy version
> > |      stats realm Haproxy\ Statistics  # Title text for popup window
> > |      stats uri /haproxy-stats  # Stats URI
> > |
> > |  #############################################################
> > |  # main endpoint for the services on the ECS cluster
> > |  frontend mh-serverless from all
> > |      bind *:8080
> > |      # simple health check for frontend LB to haproxy
> > |      monitor-uri /haproxy-health
> > |
> > |      option http-server-close
> > |      option forwardfor
> > |      http-request add-header x-forwarded-proto https
> > |      http-request add-header x-forwarded-port 443
> > |      capture request header x-amzn-trace-id len 256
> > |      capture request header x-forwarded-for len 256
> > |      capture request header host len 256
> > |      capture request header referer len 256
> > |      capture request header user-agent len 40
> > |      capture request header x-custom-header len 5
> > |      # use JSON lines format for easier parsing in cloudwatch and
> datadog
> > |      log-format
> >
> '{"timestamp":"%[date(0,ms),ms_ltime(%Y-%m-%dT%H:%M:%S.%3N%z)]","x-amzn-trace-id":"%[capture.req.hdr(0)]","x-forwarded-for":"%[capture.req.hdr(1)]","http_method":"%HM","uri_path":"%HPO","query_args":"%HQ","http_version":"%HV","http_status_code":"%ST","termination-state":"%ts","latency":%Ta,"response_length":%B,"referer":"%[capture.req.hdr(3)]","backend_name":"%b","host":"meridianlink-prod","service":"haproxy-router","backend":{"name":"%b","concurrent_connections":%bc,"source_ip":"%bi","source_port":"%bp","queue":%bq},"bytes":{"read":%B,"uploaded":%U},"captured_headers":{"request":{"x-forwarded-for":"%[capture.req.hdr(1)]","host":"%[capture.req.hdr(2)]","referer":"%[capture.req.hdr(3)]","user-agent":"%[capture.req.hdr(4)]","x-custom-header":"%[capture.req.hdr(5)]"},"response":"%hs"},"client":{"ip":"%ci","port":"%cp"},"frontend":{"name":"%f","concurrent_connections":%fc,"ip":"%fi","port":"%fp","name_transport":"%ft","log_counter":%lc},"hostname":"%H","http":{"method":"%HM","request_uri_without_query_string":"%HP","request_uri_query_string":"%HQ","request_uri":"%HU","version":"%HV","status_code":%ST,"request":"%r","retries":%rc},"process_concurrent_connections":%ac,"request_counter":%rt,"server":{"name":"%s","concurrent_connections":%sc,"ip":"%si","port":"%sp","queue":%sq},"timers":{"tr":"%tr","Ta":%Ta,"Tc":%Tc,"Td":%Td,"Th":%Th,"Ti":%Ti,"Tq":%Tq,"TR":%TR,"Tr":%Tr,"Tt":%Tt,"Tw":%Tw}}'
> > |
> > |      use_backend clientA-app if { path_beg /api/clientA/app }
> > |      use_backend clientB-app if { path_beg /api/clientB/app }
> > |      use_backend clientC-app if { path_beg /api/clientC/app }
> > |      # ..... and so on ...
> > |
> > |      default_backend static_404
> > |
> > |  #############################################################
> > |
> > |  backend clientA-app from backends
> > |      http-check send meth GET uri /health
> > |      balance roundrobin
> > |      # how long to wait for a successful connection
> > |      timeout connect 3s
> > |      # how long to wait for a response to an http request (7s is just
> slightly
> > longer than the typical FIST penalty)
> > |      timeout server 7s
> > |      # how long to wait for a response to health check
> > |      timeout check 14s
> > |      # total number of retries
> > |      # this value is appropriate when max number of tasks will be 3
> > |      retries 2
> > |      # redispatch on every retry
> > |      option redispatch 1
> > |      # explicitly list types of errors in order to not include 4XX,
> 500, 501, etc.
> > |      retry-on 0rtt-rejected response-timeout junk-response
> empty-response
> > conn-failure 502 503 504
> > |      server-template svc 3 clientA-app.ecs-cluster-prod.local:5000
> check inter
> > 15s fall 5 rise 1 resolvers awsdns init-addr none
> > |
> > |  backend clientB-app from backends
> > |      http-check send meth GET uri /health
> > |      balance roundrobin
> > |      # how long to wait for a successful connection
> > |      timeout connect 3s
> > |      # how long to wait for a response to an http request (7s is just
> slightly
> > longer than the typical FIST penalty)
> > |      timeout server 7s
> > |      # how long to wait for a response to health check
> > |      timeout check 14s
> > |      # total number of retries
> > |      # this value is appropriate when max number of tasks will be 3
> > |      retries 2
> > |      # redispatch on every retry
> > |      option redispatch 1
> > |      # explicitly list types of errors in order to not include 4XX,
> 500, 501, etc.
> > |      retry-on 0rtt-rejected response-timeout junk-response
> empty-response
> > conn-failure 502 503 504
> > |      server-template svc 3 clientB-app.ecs-cluster-prod.local:5000
> check inter
> > 15s fall 5 rise 1 resolvers awsdns init-addr none
> > |
> > |  backend clientC-app from backends
> > |      http-check send meth GET uri /health
> > |      balance roundrobin
> > |      # how long to wait for a successful connection
> > |      timeout connect 3s
> > |      # how long to wait for a response to an http request (7s is just
> slightly
> > longer than the typical FIST penalty)
> > |      timeout server 7s
> > |      # how long to wait for a response to health check
> > |      timeout check 14s
> > |      # total number of retries
> > |      # this value is appropriate when max number of tasks will be 3
> > |      retries 2
> > |      # redispatch on every retry
> > |      option redispatch 1
> > |      # explicitly list types of errors in order to not include 4XX,
> 500, 501, etc.
> > |      retry-on 0rtt-rejected response-timeout junk-response
> empty-response
> > conn-failure 502 503 504
> > |      server-template svc 3 clientC-app.ecs-cluster-prod.local:5000
> check inter
> > 15s fall 5 rise 1 resolvers awsdns init-addr none
> > |
> > |  #############################################################
> > |
> > |  backend static_404 from all
> > |      http-request use-service lua.static_404
> >
> >
> > However, these are the logs that I'm seeing:
> >
> > |
> >
>  
> #timestamp,hostname,server(%b),uri_path(%HPO),status_code(%ST),timers.Ta(%Ta),read(%B),uploaded(%U),retries(%rc),termination-state(%ts)
> > |
> >
>  
> 2024-10-13T17:37:00.916,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,7068,198,46427,0,sH
> > |
> >
>  
> 2024-10-13T17:28:19.880,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,21003,198,50147,+2,sH
> > |
> >
>  
> 2024-10-13T17:28:08.873,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7011,198,50245,0,sH
> > |
> >
>  
> 2024-10-13T12:30:24.284,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7000,198,47781,0,sH
> > |
> >
>  
> 2024-10-13T03:19:21.817,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,7003,198,41351,0,sH
> > |
> >
>  
> 2024-10-13T02:40:15.121,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,21010,198,53686,+2,sH
> > |
> >
>  
> 2024-10-13T02:39:20.349,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7002,198,67650,0,sH
> >
> > Based on the config above, I'm expecting haproxy to retry on all
> > of the errors listed in retry-on and for all request payloads <
> > 256K, but that's not what I'm seeing.  Notice that sometimes there
> > will be retries, and the timers.Ta is consistent with the total
> > number of retries, but most of the time haproxy is not retrying.
> > This is not only happening on 504s; it's happening on 503s as well.
> > Not sure on other other types since those are a little harder to
> > induce.  I've tried playing with the tune.bufsize parameter since
> > our reqest payloads can be quite large (max up to 1MB recently).
> > It's currently set it to 256K for now since that covers the 95%-ile.
> > The default 16K so initially we were getting almost zero retries,
> > but I've also tried setting tune.bufsize to 1MB, which did not
> > change anything either.  I've also played with various retry-on
> > settings (like all-retryable-errors) but same thing.
> >
>
> Hi,
>
> As you noticed, the request must fit in a buffer to perform L7-retries.
> But it
> must also be fully received. You should probably add the
> "http-buffer-request"
> option or use "wait-for-body" action if you need more control.
>
> I recently noticed it was not mentionned in the "retry-on" documentation.
> And it
> is not obvious. I must update the doc. Another solution would be to
> automatically wait for the request body if some L7-retries are enabled.
>
> --
> Christopher Faulet
>
>

Reply via email to