Hi,

We are using H2 only at the fronted. We will collect "show sess all” if the 
issue happens again.

Thank you!
Krzysztof Kozłowski

> Wiadomość napisana przez Willy Tarreau <w...@1wt.eu> w dniu 29 sie 2025, o 
> godz. 10:48:
> 
> Hi Krzysztof,
> 
> On Fri, Aug 29, 2025 at 10:21:07AM +0200, Krzysztof Kozlowski wrote:
>> We observed a high CPU utilization on several CPU cores on our HAProxy
>> instances running version 3.2.4 (canary deployment) in production.
>> 
>> Following a recent L7 DDoS attack, we noticed high CPU utilization on several
>> CPU cores on these machines. For comparison, other machines in the cluster
>> running HAProxy version 3.0.10 were not impacted by this issue. High CPU
>> utilization was visible even after we stopped sending traffic to that canary
>> machine.
> 
> Hmmm that's definitely not good!
> 
>> HAProxy canary version: 
>> # haproxy -vv
>> HAProxy version 3.2.4-98813a1 2025/08/13 - https://haproxy.org/
> (...)
>> Our  3.2.4 version also contains a patch: 
>> https://git.haproxy.org/?p=haproxy.git;a=patch;h=2c7e05f80e3b4bacad84eb213b9db3bc5d8963dc.
> 
> OK thanks for clarifying, I can confirm that this one is not relevant.
> 
>> Not sure if it is important but we assigned dedicated cores to haproxy with 
>> those commands:
>> nbthread 24
>> cpu-map auto:1/1-24 4-27
> 
> Useful as well but not involved here.
> 
>> Other machines in a cluster which were not impacted by the high CPU
>> utilization issue were running 3.0.10 version:
>> # haproxy -vv
>> HAProxy version 3.0.10-346eb4f 2025/04/22 - https://haproxy.org/
> (...)
>> After spotting high CPU utilization we stopped to send traffic to canary
>> machine (take out of the cluster), and after it we run show threads:
>> # echo "show threads" | socat /var/run/haproxy/haproxy1.sock stdio
> (...)
> 
> Many thanks for providing these outputs, they are extremely useful!
> See below:
> 
>>  Thread 6 : id=0x7f08f32e0640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>>      1/6    stuck=0 prof=0 harmless=0 isolated=0
>>             cpu_ns: poll=4546197562468 now=4546197574309 diff=11841
>>             curr_task=0x7f08e5300970 (tasklet) calls=637570029
>                                                        ^^^^^^^^^
>>               fct=0x55fbae711e00(h2_io_cb) ctx=0x7f08e6edfac0
>                                    ^^^^^^^^
> 
> This function is the ones which handles the I/O events for a HTTP/2
> connection. The fact that it was called half a billion times definitely
> explains why you're seeing a huge CPU usage.
> 
>>             lock_hist: W:TASK_WQ S:TASK_WQ U:TASK_WQ W:STK_TABLE U:STK_TABLE 
>> W:TASK_WQ R:TASK_WQ U:TASK_WQ
>>             call trace(16):
>>             | 0x55fbae834118 <00 00 00 e8 c8 ca dd ff]: 
>> ha_dump_backtrace+0x88/0x3e5 > main-0x2910
>>             | 0x55fbae837477 <00 31 f6 e8 49 f6 ff ff]: 
>> debug_handler+0xa7/0xca > ha_thread_dump_one
>>             | 0x7f098e718520 <00 00 00 00 0f 1f 40 00]: libc:+0x42520
>>             | 0x7f098eb0f2c1 <48 c7 43 f8 00 00 00 00]: 
>> libcrypto:ERR_clear_error+0x41/0xaa
>>             | 0x7f098eee6dac <00 00 00 e8 24 a5 fc ff]: libssl:+0x4bdac > 
>> libssl:+0x162d0
>>             | 0x7f098eee717f <f4 89 d5 e8 01 fc ff ff]: 
>> libssl:SSL_peek+0x1f/0x95 > libssl:+0x4bd80
>>             | 0x7f098eee7211 <83 ec 10 e8 9f a6 fc ff]: 
>> libssl:SSL_read+0x11/0x6f > libssl:+0x168b0
>>             | 0x55fbae622615 <03 76 08 e8 4b f4 fe ff]: main+0xf125 > 
>> main-0x1a90
>>             | 0x55fbae6feeaf <4c 89 f2 4c 89 e7 ff 10]: 
>> hlua_process_task+0x21e2f
>>             | 0x55fbae712184 <48 89 df e8 1c c9 fe ff]: h2_io_cb+0x384/0x637 
>> > hlua_process_task+0x21a20
> 
> Here I'm suspecting that an error is reported by the SSL stack and that
> for whatever reason it is either dropped, or ignored (e.g. if we can't
> progress in the processing for whatever reason).
> 
>>  Thread 10: id=0x7f08f12dc640 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
>>      1/10   stuck=0 prof=0 harmless=0 isolated=0
>>             cpu_ns: poll=4455933245738 now=4455933256347 diff=10609
>>             curr_task=0x7f098a48d4b0 (tasklet) calls=317426162
> 
>>  Thread 13: id=0x7f08efad9640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>>      1/13   stuck=0 prof=0 harmless=0 isolated=0
>>             cpu_ns: poll=4744029449011 now=4744029457164 diff=8153
>>             curr_task=0x7f08b4975c60 (tasklet) calls=620312397
> 
>>  Thread 15: id=0x7f08eead7640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>>      1/15   stuck=0 prof=0 harmless=0 isolated=0
>>             cpu_ns: poll=4369482934362 now=4369482942272 diff=7910
>>             curr_task=0x7f08d7ba8940 (tasklet) calls=621762017
> 
>> and then we enabled profiling:
>> # echo "set profiling tasks on" | socat /run/haproxy/haproxy1.sock stdio
>> # echo "show profiling" | socat /run/haproxy/haproxy1.sock stdio
>> Per-task CPU profiling              : on            # set profiling tasks 
>> {on|auto|off}
>> Memory usage profiling              : off           # set profiling memory 
>> {on|off}
>> Tasks activity over 42.067 sec till 0.000 sec ago:
>>  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
>>  ssl_sock_io_cb            143271204   57.69s    402.0ns   56.17s    392.0ns 
>> <- conn_subscribe@src/connection.c:959 tasklet_wakeup
>>  h2_io_cb                  143271202   56.15s    391.0ns   53.70s    374.0ns 
>> <- ssl_sock_io_cb@src/ssl_sock.c:5812 tasklet_wakeup
> 
> This is extremely useful as well, we clearly see that something is wrong
> here. We're at this location:
> 
>        if ((ctx->conn->flags & CO_FL_ERROR) ||
>            !(ctx->conn->flags & CO_FL_SSL_WAIT_HS)) {
>                int woke = 0;
> 
>                /* On error, wake any waiter */
>                if (ctx->subs) {
> -->                     tasklet_wakeup(ctx->subs->tasklet);
>                        ctx->subs->events = 0;
>                        woke = 1;
>                        ctx->subs = NULL;
>                }
> 
> So basically SSL is reporting an error, waking up the H2 layer, which
> consults the transport layer to retrieve data/get updates, and does
> nothing of it. It's unclear to me what we've changed since 3.0 that
> could explain this but that's a serious problem we need to figure and
> fix.
> 
> If the issue happens again, a "show sess all" would be super helpful
> to understand the internal states that are causing this. However, it
> can disclose some information you don't want to share publicly (e.g.
> IP addresses, internal server names etc), so if you can get one, feel
> free to send privately.
> 
> By the way, are you using H2 to reach the servers or is it only on
> the frontends ? I'm trying to figure what side could be impacted.
> 
> Thanks!
> Willy



Reply via email to