Hi Christopher, Thank you for your response!
I'll try the 'wait-for-body' and 'http-buffer-request' options. What exactly do you mean by this? "Another solution would be to automatically wait for the request body if some L7-retries are enabled." - Allen On Thu, Oct 17, 2024 at 2:12 AM Christopher Faulet <cfau...@haproxy.com> wrote: > Le 16/10/2024 à 23:10, Allen Myers a écrit : > > Hi, > > > > > > haproxy retries with redispatch are not working consistently. > > > > I've configured haproxy (3.0.5 running as a container on ECS/fargate) > > with retries > 0 to redispatch on errors. > > > > The goals of my haproxy config are: > > - retry 2 times > > - retry on most retryable errors > > - redispatch on every retry > > > > Retries with redispatch are working some of the time, but not every > > time. It's only retrying with redispatch about 20% of the time. > > > > My config is: > > > > | global > > | log stdout len 4096 format raw local0 > > | maxconn 2000 > > | tune.bufsize 262144 > > | user haproxy > > | group haproxy > > | stats socket ipv4@127.0.0.1:9999 <http://ipv4@127.0.0.1:9999> > level > > admin expose-fd listeners > > | lua-load /tmp/haproxy_files/static_404.lua > > | > > | defaults all > > | log global > > | mode http > > | option httplog > > | timeout client 50s > > | > > | defaults backends from all > > | option httpchk > > | http-response del-header server > > | > > | resolvers awsdns > > | nameserver dns1 10.10.0.2:53 <http://10.10.0.2:53> > > | accepted_payload_size 8192 # allow larger DNS payloads > > | > > | listen stats from all > > | bind :8080 > > | mode http > > | http-request use-service prometheus-exporter if { path /metrics } > > | monitor-uri /haproxy-stats-health > > | stats enable # Enable stats page > > | stats hide-version # Hide HAProxy version > > | stats realm Haproxy\ Statistics # Title text for popup window > > | stats uri /haproxy-stats # Stats URI > > | > > | ############################################################# > > | # main endpoint for the services on the ECS cluster > > | frontend mh-serverless from all > > | bind *:8080 > > | # simple health check for frontend LB to haproxy > > | monitor-uri /haproxy-health > > | > > | option http-server-close > > | option forwardfor > > | http-request add-header x-forwarded-proto https > > | http-request add-header x-forwarded-port 443 > > | capture request header x-amzn-trace-id len 256 > > | capture request header x-forwarded-for len 256 > > | capture request header host len 256 > > | capture request header referer len 256 > > | capture request header user-agent len 40 > > | capture request header x-custom-header len 5 > > | # use JSON lines format for easier parsing in cloudwatch and > datadog > > | log-format > > > '{"timestamp":"%[date(0,ms),ms_ltime(%Y-%m-%dT%H:%M:%S.%3N%z)]","x-amzn-trace-id":"%[capture.req.hdr(0)]","x-forwarded-for":"%[capture.req.hdr(1)]","http_method":"%HM","uri_path":"%HPO","query_args":"%HQ","http_version":"%HV","http_status_code":"%ST","termination-state":"%ts","latency":%Ta,"response_length":%B,"referer":"%[capture.req.hdr(3)]","backend_name":"%b","host":"meridianlink-prod","service":"haproxy-router","backend":{"name":"%b","concurrent_connections":%bc,"source_ip":"%bi","source_port":"%bp","queue":%bq},"bytes":{"read":%B,"uploaded":%U},"captured_headers":{"request":{"x-forwarded-for":"%[capture.req.hdr(1)]","host":"%[capture.req.hdr(2)]","referer":"%[capture.req.hdr(3)]","user-agent":"%[capture.req.hdr(4)]","x-custom-header":"%[capture.req.hdr(5)]"},"response":"%hs"},"client":{"ip":"%ci","port":"%cp"},"frontend":{"name":"%f","concurrent_connections":%fc,"ip":"%fi","port":"%fp","name_transport":"%ft","log_counter":%lc},"hostname":"%H","http":{"method":"%HM","request_uri_without_query_string":"%HP","request_uri_query_string":"%HQ","request_uri":"%HU","version":"%HV","status_code":%ST,"request":"%r","retries":%rc},"process_concurrent_connections":%ac,"request_counter":%rt,"server":{"name":"%s","concurrent_connections":%sc,"ip":"%si","port":"%sp","queue":%sq},"timers":{"tr":"%tr","Ta":%Ta,"Tc":%Tc,"Td":%Td,"Th":%Th,"Ti":%Ti,"Tq":%Tq,"TR":%TR,"Tr":%Tr,"Tt":%Tt,"Tw":%Tw}}' > > | > > | use_backend clientA-app if { path_beg /api/clientA/app } > > | use_backend clientB-app if { path_beg /api/clientB/app } > > | use_backend clientC-app if { path_beg /api/clientC/app } > > | # ..... and so on ... > > | > > | default_backend static_404 > > | > > | ############################################################# > > | > > | backend clientA-app from backends > > | http-check send meth GET uri /health > > | balance roundrobin > > | # how long to wait for a successful connection > > | timeout connect 3s > > | # how long to wait for a response to an http request (7s is just > slightly > > longer than the typical FIST penalty) > > | timeout server 7s > > | # how long to wait for a response to health check > > | timeout check 14s > > | # total number of retries > > | # this value is appropriate when max number of tasks will be 3 > > | retries 2 > > | # redispatch on every retry > > | option redispatch 1 > > | # explicitly list types of errors in order to not include 4XX, > 500, 501, etc. > > | retry-on 0rtt-rejected response-timeout junk-response > empty-response > > conn-failure 502 503 504 > > | server-template svc 3 clientA-app.ecs-cluster-prod.local:5000 > check inter > > 15s fall 5 rise 1 resolvers awsdns init-addr none > > | > > | backend clientB-app from backends > > | http-check send meth GET uri /health > > | balance roundrobin > > | # how long to wait for a successful connection > > | timeout connect 3s > > | # how long to wait for a response to an http request (7s is just > slightly > > longer than the typical FIST penalty) > > | timeout server 7s > > | # how long to wait for a response to health check > > | timeout check 14s > > | # total number of retries > > | # this value is appropriate when max number of tasks will be 3 > > | retries 2 > > | # redispatch on every retry > > | option redispatch 1 > > | # explicitly list types of errors in order to not include 4XX, > 500, 501, etc. > > | retry-on 0rtt-rejected response-timeout junk-response > empty-response > > conn-failure 502 503 504 > > | server-template svc 3 clientB-app.ecs-cluster-prod.local:5000 > check inter > > 15s fall 5 rise 1 resolvers awsdns init-addr none > > | > > | backend clientC-app from backends > > | http-check send meth GET uri /health > > | balance roundrobin > > | # how long to wait for a successful connection > > | timeout connect 3s > > | # how long to wait for a response to an http request (7s is just > slightly > > longer than the typical FIST penalty) > > | timeout server 7s > > | # how long to wait for a response to health check > > | timeout check 14s > > | # total number of retries > > | # this value is appropriate when max number of tasks will be 3 > > | retries 2 > > | # redispatch on every retry > > | option redispatch 1 > > | # explicitly list types of errors in order to not include 4XX, > 500, 501, etc. > > | retry-on 0rtt-rejected response-timeout junk-response > empty-response > > conn-failure 502 503 504 > > | server-template svc 3 clientC-app.ecs-cluster-prod.local:5000 > check inter > > 15s fall 5 rise 1 resolvers awsdns init-addr none > > | > > | ############################################################# > > | > > | backend static_404 from all > > | http-request use-service lua.static_404 > > > > > > However, these are the logs that I'm seeing: > > > > | > > > > #timestamp,hostname,server(%b),uri_path(%HPO),status_code(%ST),timers.Ta(%Ta),read(%B),uploaded(%U),retries(%rc),termination-state(%ts) > > | > > > > 2024-10-13T17:37:00.916,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,7068,198,46427,0,sH > > | > > > > 2024-10-13T17:28:19.880,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,21003,198,50147,+2,sH > > | > > > > 2024-10-13T17:28:08.873,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7011,198,50245,0,sH > > | > > > > 2024-10-13T12:30:24.284,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7000,198,47781,0,sH > > | > > > > 2024-10-13T03:19:21.817,hostA.com,clientB-app.ecs-cluster-prod.local:5000,/api/clientB/app,504,7003,198,41351,0,sH > > | > > > > 2024-10-13T02:40:15.121,hostA.com,clientA-app.ecs-cluster-prod.local:5000,/api/clientA/app,504,21010,198,53686,+2,sH > > | > > > > 2024-10-13T02:39:20.349,hostA.com,clientC-app.ecs-cluster-prod.local:5000,/api/clientC/app,504,7002,198,67650,0,sH > > > > Based on the config above, I'm expecting haproxy to retry on all > > of the errors listed in retry-on and for all request payloads < > > 256K, but that's not what I'm seeing. Notice that sometimes there > > will be retries, and the timers.Ta is consistent with the total > > number of retries, but most of the time haproxy is not retrying. > > This is not only happening on 504s; it's happening on 503s as well. > > Not sure on other other types since those are a little harder to > > induce. I've tried playing with the tune.bufsize parameter since > > our reqest payloads can be quite large (max up to 1MB recently). > > It's currently set it to 256K for now since that covers the 95%-ile. > > The default 16K so initially we were getting almost zero retries, > > but I've also tried setting tune.bufsize to 1MB, which did not > > change anything either. I've also played with various retry-on > > settings (like all-retryable-errors) but same thing. > > > > Hi, > > As you noticed, the request must fit in a buffer to perform L7-retries. > But it > must also be fully received. You should probably add the > "http-buffer-request" > option or use "wait-for-body" action if you need more control. > > I recently noticed it was not mentionned in the "retry-on" documentation. > And it > is not obvious. I must update the doc. Another solution would be to > automatically wait for the request body if some L7-retries are enabled. > > -- > Christopher Faulet > >