Hi Miles,

On Thu, Feb 01, 2024 at 05:09:20PM +1100, Miles Hampson wrote:
> Hi,
> 
> We recently hit an issue where we observed the
> haproxy_frontend_current_sessions reported by the prometheus endpoint
> plateau at 4095 and some requests start dropping. Increasing the global and
> listen maxconn from 4096 to something larger (as well as making the kernel
> TCP queues on our Ubuntu 22.04 OS slighty larger) fixed the issue.
> 
> The cause seems to have been a switch from http to https traffic due to a
> client side config change, rather than an increase in the number of
> requests, so I started looking at CPU usage to see if the SSL load was too
> much for our server CPUs. However on one of the modern 24 core machines
> running HAProxy I noticed top was only reporting around 100% CPU usage,
> with both the user and system CPU distributed pretty evenly across all the
> cores (4-8% user per core, 0.5-2% system). The idle percentage was in the
> high nineties, both as reported by top and by the haproxy socket Idle_pct.
> This was just a quick gathering of info and may not be representative,
> since our prometheus node exporter only shows overall CPU (which was a low
> 5% of the total on all cores throughout). This is for a bare metal server
> which is just running a HAProxy processing around 200 SSL req/sec, and not
> doing much else.

So that would give roughly 4000 SSL req/sec max over all cores, or only
166 per core, that sounds quite low!

> I started wondering if our global settings:
> 
>   master-worker
>   nbthread 24
>   cpu-map auto:1/1-24 0-23
>   tune.ssl.cachesize 100000
> 
> were appropriate or if they had caused some inefficiency in using our
> machine's cores, which then caused this backlog. Or whether what I am
> observing is completely normal, given that we are now spending more time on
> SSL decoding so can expect more queuing (our backend servers are very fast
> and so we run them with a small maxconn, but they don't care if the request
> is SSL or not so the overall request time should be the same other than SSL
> processing time). We are running either the latest OpenSSL 1.1.1 or
> WolfSSL, all compiled sensibly (AES-NI etc).

Normally on a modern x86 CPU core, you should expect roughly 500 RSA2048/s
per core and per GHz (or keep in mind 1000/s for an average 2GHz core).
RSA4096 however, it much slower, usually 7 times or so. Example here on
a core i9-9900K at 5GHz:

  $ openssl speed rsa2048
                    sign    verify    sign/s verify/s
  rsa 2048 bits 0.000404s 0.000012s   2476.7  83266.3
  rsa 4096 bits 0.002726s 0.000042s    366.8  23632.8

On ARM however it will vary with the cores but up to the Neoverse-N1
(e.g. Graviton2), it was not fantastic, to say the least, around 100/s
per GHz for RSA2048 and 14/s/GHz for RSA4096. Neoverse-V1 as in Graviton3
is way better though, about 2/3 of x86.

> I turned to https://docs.haproxy.org/2.9/management.html#7 which had some
> very interesting advice about pinning haproxy to one CPU core and the
> interrupts to another one, but it also mentioned nbproc and the bind
> process option for better SSL traffic processing. Given that seems to be a
> bit out of date, I thought I might ask my question here instead.

Oops, good catch, I hoped we got rid of all references to nbproc, we'll
definitely have to clean that one!

> Is there a way to use the CPU cores available on our HAProxy machines to
> handle SSL requests better than I have with the global config above?

At least I'm having a questoin in return, what is this CPU exactly? I'm
asking because you mentioned 24 cores and you started 24 threads, but x86
CPUs usually are SMT-capable via HyperThreading or the equivalent from
AMD, so you have twice the number of threads. The gain is not much, since
both threads of a core share the same compute units, for lots of stuff
it ends in about 10-15% performance increase, and for SSL it brings almost
zero since the calculation code is already optimized to use the ALU fully.

But what's interesting with the second thread is sometimes to let the
network run on it. But usually when you're tuning for SSL, the network
is not the problem, and conversely. Let me explain. It takes just a few
Mbps of network traffic to saturate a machine with SSL handshakes. This
means that if your SSL stack is working like crazy doing computations
to the point of maxing out the CPU, chances are that you're under attack
and that your network load is very low. Conversely, if you're dealing
with a lot of network traffic, it usually means your site is designed
so that you deliver a lot of bytes without forcing clients to perform
handshakes all the time.

So in the end it often makes sense to let both haproxy and the network
stack coexist on all threads of all cores, so that the unused CPU in a
certain situation is available to the other (if only to better deal with
attacks or unexpected traffic surges).

OpenSSL doesn't scale well, but 1.1.1 isn't that bad. It reaches a plateau
around 16 threads, which will usually represent 10-20k connections/s for
most machines, and it doesn't quickly fall past that point. So if you
have 24 cores and 48 threads, it's fine to just remove any "nbthreads"
and "cpu-map" directives from the global section and let haproxy
automatically bind to all CPUs.

Where there is something to gain is sometimes on the accept() side.
By default when you have one listener bound to many threads, there
is some contention in the system. By starting more listeners you can
reduce or eliminate that contention. The "bind" lines support a
"shards" directive which allows you to indicate how many listening
sockets you want to start, one special value is "by-thread", where
there will be one socket per thread. But it will also let the kernel
decide on your thread based on the source port of the incoming
connection and sometimes it's not as nicely balanced. The default
value is "by-group", which means one listener per thread-group. There's
a default of one thread group, but it's possible to configure more;
that's necessary to go over 64 threads but it makes sense on CPUs
which have high-latency independent L3 caches such as EPYCs of 2nd
and 3rd gen, where you prefer to have one group per core complex and
manually map your threads to the respective CPUs to limit the sharing
between them.

But I don't know what load level you're aiming for. All tuning has
a cost (including in testing and config maintenance), and can be
counter-productive if done randomly. Up to a few thousands conn/s
I generally suggest to just do nothing and let haproxy configure
itself as it sees fit (i.e. one thread per logical CPU on the first
NUMA node, one thread group). My record was around 140k RSA2048/s on
a huge Sapphire Rapids machine with 224 threads (we were lacking
load generators to reach the absolute limit), but it took weeks of
tuning and code adjustments to reach that level. It's not realistic
for production, but it was fine to kill many bottlenecks we had in
the code and even to improve parts of the SSL engine.

> I
> realise this is a bit of an open ended question, but for example I was
> wondering if we could reduce the number of active sessions (so we don't hit
> maxconn) by increasing threads beyond the number of CPU cores, it naively
> seems that might increase per session latency but increase overall
> throughput since we don't appear to be taxing any of the cores (and have
> lots of memory available on these machines).

No, it should not change anything. In fact while the CPU time spent on a
key is huge, it's short compared to a network round trip. An SSL key is
roughly 1ms of CPU time. So if you're seeing too many connections, it
would mean you're getting more than 1 connection every millisecond on
a given core. Threading can help in this case to spread the load on more
cores to accept such connections but that's not what you're observing.

I suspect that what you're observing instead is just the result of your
site being popular enough and browsers performing pre-connect: based on
the browsing habits of their owner, they'll connect to the site in case
they're expected to be about to visit it, so that the connection is
already available. And you can accumulate lots of idle connections like
this. They'll be killed by the http-request timeout if no request comes
on them, but it can definitely count. We've seen some sites having to
significantly increase their max number of connections when this started
to appear a decade ago.

In any case, if you have some RAM (and I suspect a 24-core machines does),
you should leave large values on the front maxconn so that the kernel
never refrains from accepting connections, because once the kernel's
accept queue is full, the SYN packets are dropped and have to be
retransmitted, which is much slower for the visitor. Very roughly
speaking, an SSL connection during a handshake will take max 100kB of
RAM (it's less but better round it up). This means that you're supposed
to be able to support 10k per GB of RAM assigned to haproxy. This doesn't
include request processing, forwarding to the server, possible handshakes
on the other side, the use of large maps or stick-tables, nor the TCP
stack's buffers. But if you keep in mind 10k conn / GB and you consider
that haproxy will never use more than half of your machine's RAM, you see
that you can be pretty safe with 40k conns on a 8GB machine, 4 of which
are assigned to haproxy.

> As I said I am not even sure
> there is a problem, but I would like to understand a bit better if there is
> anything we can do to help HAProxy use the CPU cores more effectively,
> since all the advice I can find is obsolete (nbproc etc) and it is quite
> hard to experiment when I don't know what is good to measure.

You're totally right and actually I'm glad you asked, because there are
some power users on this list and maybe some would like to chime in and
share some of their observations, suggestions, even disagree with my
figures above, etc. Tuning is never simple, you need to know what you
want to tune for and in your case that's apparently resource usage and
safety margin, so this involves a lot of parameters, which sometimes
goes down to the choice of hardware vendor (CPU, NICs, etc).

Willy

Reply via email to