Re: Theoretical limits for a HAProxy instance

Iago Alonso Wed, 21 Dec 2022 08:39:59 -0800

Interesting, so you have conntrack enabled. With 5M conns, there's no
reason to fill your table. However, have you checked your servers' kernel
logs to see if you find any "conntrack table full" message, that might
be caused by too fast connection recycling ?

We can't see any message mentioning conntrack.

With that said, excessive use of rwlock indicates that openssl is
very likely there, because that's about the only caller in the process.

Indeed, we tried compiling HAProxy 2.6.6 against OpenSSL 1.1.1s and the

`perf top` results look much better. We can't recally why we decided

to use OpenSSL v3, but seems like we have to stick to v1.

   2.61%  haproxy                    [.] sha512_block_data_order_avx2
   1.89%  haproxy                    [.] fe_mul
   1.64%  haproxy                    [.] x25519_fe64_sqr
   1.44%  libc.so.6                    [.] malloc
   1.39%  [kernel]                     [k] __nf_conntrack_find_get
   1.21%  libc.so.6                    [.] cfree

0.95% [kernel] [k]native_queued_spin_lock_slowpath.part.0

   0.94%  haproxy                    [.] OPENSSL_cleanse
   0.78%  libc.so.6                    [.] 0x00000000000a3b46
   0.78%  [kernel]                     [k] do_epoll_ctl
   0.75%  [kernel]                     [k] ___slab_alloc
   0.73%  [kernel]                     [k] psi_group_change

For reference, with a 12 core machine, we ran a test with 420k connections,

32k rps, and a sustained ssl rate of 5k, and the load1 was about 9, and

the idle was about 30.

And as a side note, we didn't have http/2 enabled, and after doing

so, we ran the same test, and the idle was about 55, and the load1 wasabout 6.

OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
fixes, though these should not impact your tests and 2.7.0 will be OK).
And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
If so it will definitely confirm that you're endlessly waiting for some
threads stuck in SSL waiting on a mutex.

We've just tried 2.7.1 with USE_PTHREAD_EMULATION=1 (OpenSSL 3.0.7), and

we the results are pretty much the same, no noticeable improvement.

Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
rule in your iptables whose counter increases during the tests. It might
be possible that certain outgoing connections are rejected due to fast
port recycling if they were closed by an RST that was lost before hitting
conntrack. Another possibility here is that the TCP connection was
established but that the SSL handshake didn't complete in time. This
could be caused precisely by the openssl 3 locking issues I was speaking
about, but I guess not at such numbers, and you'd very likely see
the native_queue_spinlock_slowpath thing in perf top.

We do have a `-A ufw-before-input -m conntrack --ctstate INVALID -j DROP`

but the counter does not increase during the tests. Looks like OpenSSL

v3 is the issue.


On 16/12/22 18:11, Willy Tarreau wrote:

On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote:

Hi,

Ah that's pretty useful :-) It's very likely dealing with the handshake.
Could you please run "perf top" on this machine and list the 10 top-most
lines ? I'm interested in seeing if you're saturating on crypto functions
or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
indicate an abuse of mutexes.

We are load testing with k6 (https://k6.io/), in a server with an
AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
450k. The `haproxy_server_connect_time_average_seconds` was between
0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
always about 0.2s lower.

We have this custom kernel parameters set:
net.nf_conntrack_max=5000000
fs.nr_open=5000000
net.ipv4.ip_local_port_range=12768    60999

Interesting, so you have conntrack enabled. With 5M conns, there's no
reason to fill your table. However, have you checked your servers' kernel
logs to see if you find any "conntrack table full" message, that might
be caused by too fast connection recycling ?

At that load, we were starting to see just a small number of 5xx, and
in the logs
we could see that they were caused by connect timeout.
XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker
261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET
/ HTTP/1.1" 0/0000000000000000/0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384

Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
rule in your iptables whose counter increases during the tests. It might
be possible that certain outgoing connections are rejected due to fast
port recycling if they were closed by an RST that was lost before hitting
conntrack. Another possibility here is that the TCP connection was
established but that the SSL handshake didn't complete in time. This
could be caused precisely by the openssl 3 locking issues I was speaking
about, but I guess not at such numbers, and you'd very likely see
the native_queue_spinlock_slowpath thing in perf top.

This is the output from `perf top`, after 40M samples, during the test

   17.38%  libc.so.6                   [.] pthread_rwlock_rdlock
   10.25%  libc.so.6                   [.] pthread_rwlock_unlock

With that said, excessive use of rwlock indicates that openssl is
very likely there, because that's about the only caller in the process.

To counter this abuse, we've developed a dirty counter-measure that
consists in providing our own rwlock implementation as a replacement for
the pthread one. That's really not beautiful but we have to deal with
what we have. You can enable it in 2.7 by adding USE_PTHREAD_EMULATION=1
to your "make" command line. These locks are way less expensive but they
will never sleep (no syscall) so instead of idle when the machine is
congested, you'll see pl_wait_unlock_long() eating CPU in "perf top"
output. In our tests that roughly doubles the performance under high
congestion but it's hard to be that specific. In fact it mostly gets
us our lock callback back, after they were removed in 1.1.0.

Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04,
and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6,
compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable
performance
hit running similar tests as the previously mentioned one. We are
planning to try again
(next week) just stressing the sslrate so we can assess a bit more the
performance
hit.

OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
fixes, though these should not impact your tests and 2.7.0 will be OK).
And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
If so it will definitely confirm that you're endlessly waiting for some
threads stuck in SSL waiting on a mutex.

I would really like that we distinguish L4 from L6 in logs output to
indicate whether L4 was established and SSL not or if it's blocking at
L4. The tricky part is that it requires adding new values for the
termination flags, and everyone got used a lot to the current ones...
But I think we'll have to do it anyway, at least to sort out such
issues.

Regards,
Willy

Re: Theoretical limits for a HAProxy instance

Reply via email to