Hi, We are using H2 only at the fronted. We will collect "show sess all” if the issue happens again.
Thank you! Krzysztof Kozłowski > Wiadomość napisana przez Willy Tarreau <w...@1wt.eu> w dniu 29 sie 2025, o > godz. 10:48: > > Hi Krzysztof, > > On Fri, Aug 29, 2025 at 10:21:07AM +0200, Krzysztof Kozlowski wrote: >> We observed a high CPU utilization on several CPU cores on our HAProxy >> instances running version 3.2.4 (canary deployment) in production. >> >> Following a recent L7 DDoS attack, we noticed high CPU utilization on several >> CPU cores on these machines. For comparison, other machines in the cluster >> running HAProxy version 3.0.10 were not impacted by this issue. High CPU >> utilization was visible even after we stopped sending traffic to that canary >> machine. > > Hmmm that's definitely not good! > >> HAProxy canary version: >> # haproxy -vv >> HAProxy version 3.2.4-98813a1 2025/08/13 - https://haproxy.org/ > (...) >> Our 3.2.4 version also contains a patch: >> https://git.haproxy.org/?p=haproxy.git;a=patch;h=2c7e05f80e3b4bacad84eb213b9db3bc5d8963dc. > > OK thanks for clarifying, I can confirm that this one is not relevant. > >> Not sure if it is important but we assigned dedicated cores to haproxy with >> those commands: >> nbthread 24 >> cpu-map auto:1/1-24 4-27 > > Useful as well but not involved here. > >> Other machines in a cluster which were not impacted by the high CPU >> utilization issue were running 3.0.10 version: >> # haproxy -vv >> HAProxy version 3.0.10-346eb4f 2025/04/22 - https://haproxy.org/ > (...) >> After spotting high CPU utilization we stopped to send traffic to canary >> machine (take out of the cluster), and after it we run show threads: >> # echo "show threads" | socat /var/run/haproxy/haproxy1.sock stdio > (...) > > Many thanks for providing these outputs, they are extremely useful! > See below: > >> Thread 6 : id=0x7f08f32e0640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0 >> 1/6 stuck=0 prof=0 harmless=0 isolated=0 >> cpu_ns: poll=4546197562468 now=4546197574309 diff=11841 >> curr_task=0x7f08e5300970 (tasklet) calls=637570029 > ^^^^^^^^^ >> fct=0x55fbae711e00(h2_io_cb) ctx=0x7f08e6edfac0 > ^^^^^^^^ > > This function is the ones which handles the I/O events for a HTTP/2 > connection. The fact that it was called half a billion times definitely > explains why you're seeing a huge CPU usage. > >> lock_hist: W:TASK_WQ S:TASK_WQ U:TASK_WQ W:STK_TABLE U:STK_TABLE >> W:TASK_WQ R:TASK_WQ U:TASK_WQ >> call trace(16): >> | 0x55fbae834118 <00 00 00 e8 c8 ca dd ff]: >> ha_dump_backtrace+0x88/0x3e5 > main-0x2910 >> | 0x55fbae837477 <00 31 f6 e8 49 f6 ff ff]: >> debug_handler+0xa7/0xca > ha_thread_dump_one >> | 0x7f098e718520 <00 00 00 00 0f 1f 40 00]: libc:+0x42520 >> | 0x7f098eb0f2c1 <48 c7 43 f8 00 00 00 00]: >> libcrypto:ERR_clear_error+0x41/0xaa >> | 0x7f098eee6dac <00 00 00 e8 24 a5 fc ff]: libssl:+0x4bdac > >> libssl:+0x162d0 >> | 0x7f098eee717f <f4 89 d5 e8 01 fc ff ff]: >> libssl:SSL_peek+0x1f/0x95 > libssl:+0x4bd80 >> | 0x7f098eee7211 <83 ec 10 e8 9f a6 fc ff]: >> libssl:SSL_read+0x11/0x6f > libssl:+0x168b0 >> | 0x55fbae622615 <03 76 08 e8 4b f4 fe ff]: main+0xf125 > >> main-0x1a90 >> | 0x55fbae6feeaf <4c 89 f2 4c 89 e7 ff 10]: >> hlua_process_task+0x21e2f >> | 0x55fbae712184 <48 89 df e8 1c c9 fe ff]: h2_io_cb+0x384/0x637 >> > hlua_process_task+0x21a20 > > Here I'm suspecting that an error is reported by the SSL stack and that > for whatever reason it is either dropped, or ignored (e.g. if we can't > progress in the processing for whatever reason). > >> Thread 10: id=0x7f08f12dc640 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1 >> 1/10 stuck=0 prof=0 harmless=0 isolated=0 >> cpu_ns: poll=4455933245738 now=4455933256347 diff=10609 >> curr_task=0x7f098a48d4b0 (tasklet) calls=317426162 > >> Thread 13: id=0x7f08efad9640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0 >> 1/13 stuck=0 prof=0 harmless=0 isolated=0 >> cpu_ns: poll=4744029449011 now=4744029457164 diff=8153 >> curr_task=0x7f08b4975c60 (tasklet) calls=620312397 > >> Thread 15: id=0x7f08eead7640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0 >> 1/15 stuck=0 prof=0 harmless=0 isolated=0 >> cpu_ns: poll=4369482934362 now=4369482942272 diff=7910 >> curr_task=0x7f08d7ba8940 (tasklet) calls=621762017 > >> and then we enabled profiling: >> # echo "set profiling tasks on" | socat /run/haproxy/haproxy1.sock stdio >> # echo "show profiling" | socat /run/haproxy/haproxy1.sock stdio >> Per-task CPU profiling : on # set profiling tasks >> {on|auto|off} >> Memory usage profiling : off # set profiling memory >> {on|off} >> Tasks activity over 42.067 sec till 0.000 sec ago: >> function calls cpu_tot cpu_avg lat_tot lat_avg >> ssl_sock_io_cb 143271204 57.69s 402.0ns 56.17s 392.0ns >> <- conn_subscribe@src/connection.c:959 tasklet_wakeup >> h2_io_cb 143271202 56.15s 391.0ns 53.70s 374.0ns >> <- ssl_sock_io_cb@src/ssl_sock.c:5812 tasklet_wakeup > > This is extremely useful as well, we clearly see that something is wrong > here. We're at this location: > > if ((ctx->conn->flags & CO_FL_ERROR) || > !(ctx->conn->flags & CO_FL_SSL_WAIT_HS)) { > int woke = 0; > > /* On error, wake any waiter */ > if (ctx->subs) { > --> tasklet_wakeup(ctx->subs->tasklet); > ctx->subs->events = 0; > woke = 1; > ctx->subs = NULL; > } > > So basically SSL is reporting an error, waking up the H2 layer, which > consults the transport layer to retrieve data/get updates, and does > nothing of it. It's unclear to me what we've changed since 3.0 that > could explain this but that's a serious problem we need to figure and > fix. > > If the issue happens again, a "show sess all" would be super helpful > to understand the internal states that are causing this. However, it > can disclose some information you don't want to share publicly (e.g. > IP addresses, internal server names etc), so if you can get one, feel > free to send privately. > > By the way, are you using H2 to reach the servers or is it only on > the frontends ? I'm trying to figure what side could be impacted. > > Thanks! > Willy