Hi Willy,

Thanks for your response, I learned a lot from reading it.

> So that would give roughly 4000 SSL req/sec max over all cores, or only
> 166 per core, that sounds quite low!
>
> Normally on a modern x86 CPU core, you should expect roughly 500 RSA2048/s
> per core and per GHz (or keep in mind 1000/s for an average 2GHz core).
> RSA4096 however, it much slower, usually 7 times or so. Example here on
> a core i9-9900K at 5GHz:
>
>   $ openssl speed rsa2048
>                     sign    verify    sign/s verify/s
>   rsa 2048 bits 0.000404s 0.000012s   2476.7  83266.3
>   rsa 4096 bits 0.002726s 0.000042s    366.8  23632.8
>
> On ARM however it will vary with the cores but up to the Neoverse-N1
> (e.g. Graviton2), it was not fantastic, to say the least, around 100/s
> per GHz for RSA2048 and 14/s/GHz for RSA4096. Neoverse-V1 as in Graviton3
> is way better though, about 2/3 of x86.

Interesting, our HAProxy machines are i9-13900 and so the rsa2048 results
are very close to yours (2676.2 sign/s 85263.4 verify/s). That is with the
not-so-great OpenSSL 3 version that ships with our 22.04, rather than the
1.1.1 that HAProxy is using, but in a single core test hopefully that
shouldn't matter. Actually almost all of our traffic is
TLS_AES_128_GCM_SHA256, but openssl speed -evp aes-128-gcm gives me output
in bytes per second (reports a few GB per second) and so is not directly
comparable. I believe our AES operations with AES-NI enabled should be
faster than rsa2048, which suggests your point is absolutely right and this
machine should be able to handle at least an order of magnitude more SSL
req/sec/core.

> At least I'm having a questoin in return, what is this CPU exactly? I'm
> asking because you mentioned 24 cores and you started 24 threads, but x86
> CPUs usually are SMT-capable via HyperThreading or the equivalent from
> AMD, so you have twice the number of threads. The gain is not much, since
> both threads of a core share the same compute units, for lots of stuff
> it ends in about 10-15% performance increase, and for SSL it brings almost
> zero since the calculation code is already optimized to use the ALU fully.

Good point, the i9-13900 has 8 P cores and 16 E cores, with 32 threads. I
don't know if there is any difference in the ALU for the P cores compared
to E cores, but it sounds like there is no point stacking multiple threads
on a core if we can assume there is around one ALU per core.

> But what's interesting with the second thread is sometimes to let the
> network run on it. But usually when you're tuning for SSL, the network
> is not the problem, and conversely. Let me explain. It takes just a few
> Mbps of network traffic to saturate a machine with SSL handshakes. This
> means that if your SSL stack is working like crazy doing computations
> to the point of maxing out the CPU, chances are that you're under attack
> and that your network load is very low. Conversely, if you're dealing
> with a lot of network traffic, it usually means your site is designed
> so that you deliver a lot of bytes without forcing clients to perform
> handshakes all the time.
>
> So in the end it often makes sense to let both haproxy and the network
> stack coexist on all threads of all cores, so that the unused CPU in a
> certain situation is available to the other (if only to better deal with
> attacks or unexpected traffic surges).
>
> OpenSSL doesn't scale well, but 1.1.1 isn't that bad. It reaches a plateau
> around 16 threads, which will usually represent 10-20k connections/s for
> most machines, and it doesn't quickly fall past that point.

I am actually also testing HAProxy with wolfSSL (Built with OpenSSL version
: wolfSSL 5.6.6, with --enable-aesni --enable-intelasm in the compile
options) and on that same machine type it seems to distribute that same
amount of load slightly more evenly across the cores, but not by much. I
probably should test with a higher load to see if that continues to be the
case when we get closer to the capacity of each core.

> So if you have 24 cores and 48 threads, it's fine to just remove any
"nbthreads"
> and "cpu-map" directives from the global section and let haproxy
> automatically bind to all CPUs.

I read in the latest configuration manual for nbthread it says "On some
platforms supporting CPU affinity, the default "nbthread" value is
automatically set to the number of CPUs the process is bound to upon
startup ... Otherwise, this value defaults to 1", so I was hoping that
HAProxy would automatically bind to all CPUs but wasn't 100% sure what the
default would be if I omitted cpu-map. It is good to have confirmation that
it will bind to all CPUs in that case, I will run with that.

> Where there is something to gain is sometimes on the accept() side.
> By default when you have one listener bound to many threads, there
> is some contention in the system. By starting more listeners you can
> reduce or eliminate that contention. The "bind" lines support a
> "shards" directive which allows you to indicate how many listening
> sockets you want to start, one special value is "by-thread", where
> there will be one socket per thread. But it will also let the kernel
> decide on your thread based on the source port of the incoming
> connection and sometimes it's not as nicely balanced. The default
> value is "by-group", which means one listener per thread-group. There's
> a default of one thread group, but it's possible to configure more;
> that's necessary to go over 64 threads but it makes sense on CPUs
> which have high-latency independent L3 caches such as EPYCs of 2nd
> and 3rd gen, where you prefer to have one group per core complex and
> manually map your threads to the respective CPUs to limit the sharing
> between them.
>
> But I don't know what load level you're aiming for. All tuning has
> a cost (including in testing and config maintenance), and can be
> counter-productive if done randomly. Up to a few thousands conn/s
> I generally suggest to just do nothing and let haproxy configure
> itself as it sees fit (i.e. one thread per logical CPU on the first
> NUMA node, one thread group). My record was around 140k RSA2048/s on
> a huge Sapphire Rapids machine with 224 threads (we were lacking
> load generators to reach the absolute limit), but it took weeks of
> tuning and code adjustments to reach that level. It's not realistic
> for production, but it was fine to kill many bottlenecks we had in
> the code and even to improve parts of the SSL engine.

Great to learn about this, I was wondering if there was some thread
equivalent to the old process bind keyword. We actually do a lot more http
requests than https since we get cloudflare to do some of our https for us,
the HAProxy frontend on each server is handling around 1K req/s with 200/s
of those being SSL. It will be interesting to experiment with this and see
the effects on CPU usage and latency even if this is not strictly necessary
right now and it would be quite easy to make things worse. Also with
by-group and thread-group, because the P cores might be more suited to SSL.

> No, it should not change anything. In fact while the CPU time spent on a
> key is huge, it's short compared to a network round trip. An SSL key is
> roughly 1ms of CPU time. So if you're seeing too many connections, it
> would mean you're getting more than 1 connection every millisecond on
> a given core. Threading can help in this case to spread the load on more
> cores to accept such connections but that's not what you're observing.
>
> I suspect that what you're observing instead is just the result of your
> site being popular enough and browsers performing pre-connect: based on
> the browsing habits of their owner, they'll connect to the site in case
> they're expected to be about to visit it, so that the connection is
> already available. And you can accumulate lots of idle connections like
> this. They'll be killed by the http-request timeout if no request comes
> on them, but it can definitely count. We've seen some sites having to
> significantly increase their max number of connections when this started
> to appear a decade ago.
>
> In any case, if you have some RAM (and I suspect a 24-core machines does),
> you should leave large values on the front maxconn so that the kernel
> never refrains from accepting connections, because once the kernel's
> accept queue is full, the SYN packets are dropped and have to be
> retransmitted, which is much slower for the visitor. Very roughly
> speaking, an SSL connection during a handshake will take max 100kB of
> RAM (it's less but better round it up). This means that you're supposed
> to be able to support 10k per GB of RAM assigned to haproxy. This doesn't
> include request processing, forwarding to the server, possible handshakes
> on the other side, the use of large maps or stick-tables, nor the TCP
> stack's buffers. But if you keep in mind 10k conn / GB and you consider
> that haproxy will never use more than half of your machine's RAM, you see
> that you can be pretty safe with 40k conns on a 8GB machine, 4 of which
> are assigned to haproxy.

That seems sensible. The vast majority of our sessions are very short
lived, we just deliver a response with a link to a CDN etc, so the 50%th
percentile from our backend servers is around 3-4ms. I have set up HAProxy
with an 8s connect and 30s client and server timeout. Even at 1000 req/s it
would be quite easy to fill up a 4096 maxconn (and the default
somaxconn/tcp_max_syn_backlog queue) with those timeouts if the backends
slowed down.

> You're totally right and actually I'm glad you asked, because there are
> some power users on this list and maybe some would like to chime in and
> share some of their observations, suggestions, even disagree with my
> figures above, etc. Tuning is never simple, you need to know what you
> want to tune for and in your case that's apparently resource usage and
> safety margin, so this involves a lot of parameters, which sometimes
> goes down to the choice of hardware vendor (CPU, NICs, etc).
>
> Willy

Absolutely, any tuning tips or up to date HAProxy tuning guides would be
really helpful for me and perhaps others reading, I certainly haven't found
any. For brevity I omitted a lot of details of our setup and hardware (I
should mention we are running HAProxy 2.9 though), and we may just be weird
outlier use case that wouldn't be too useful to others, but given we are
running on commodity servers with normal NICs and looking to tune resource
usages for http and https throughput, I hope there is some more good
information out there to be found.

Thanks again for your time,

Miles

Reply via email to