Hi Willy, Thanks for your response, I learned a lot from reading it.
> So that would give roughly 4000 SSL req/sec max over all cores, or only > 166 per core, that sounds quite low! > > Normally on a modern x86 CPU core, you should expect roughly 500 RSA2048/s > per core and per GHz (or keep in mind 1000/s for an average 2GHz core). > RSA4096 however, it much slower, usually 7 times or so. Example here on > a core i9-9900K at 5GHz: > > $ openssl speed rsa2048 > sign verify sign/s verify/s > rsa 2048 bits 0.000404s 0.000012s 2476.7 83266.3 > rsa 4096 bits 0.002726s 0.000042s 366.8 23632.8 > > On ARM however it will vary with the cores but up to the Neoverse-N1 > (e.g. Graviton2), it was not fantastic, to say the least, around 100/s > per GHz for RSA2048 and 14/s/GHz for RSA4096. Neoverse-V1 as in Graviton3 > is way better though, about 2/3 of x86. Interesting, our HAProxy machines are i9-13900 and so the rsa2048 results are very close to yours (2676.2 sign/s 85263.4 verify/s). That is with the not-so-great OpenSSL 3 version that ships with our 22.04, rather than the 1.1.1 that HAProxy is using, but in a single core test hopefully that shouldn't matter. Actually almost all of our traffic is TLS_AES_128_GCM_SHA256, but openssl speed -evp aes-128-gcm gives me output in bytes per second (reports a few GB per second) and so is not directly comparable. I believe our AES operations with AES-NI enabled should be faster than rsa2048, which suggests your point is absolutely right and this machine should be able to handle at least an order of magnitude more SSL req/sec/core. > At least I'm having a questoin in return, what is this CPU exactly? I'm > asking because you mentioned 24 cores and you started 24 threads, but x86 > CPUs usually are SMT-capable via HyperThreading or the equivalent from > AMD, so you have twice the number of threads. The gain is not much, since > both threads of a core share the same compute units, for lots of stuff > it ends in about 10-15% performance increase, and for SSL it brings almost > zero since the calculation code is already optimized to use the ALU fully. Good point, the i9-13900 has 8 P cores and 16 E cores, with 32 threads. I don't know if there is any difference in the ALU for the P cores compared to E cores, but it sounds like there is no point stacking multiple threads on a core if we can assume there is around one ALU per core. > But what's interesting with the second thread is sometimes to let the > network run on it. But usually when you're tuning for SSL, the network > is not the problem, and conversely. Let me explain. It takes just a few > Mbps of network traffic to saturate a machine with SSL handshakes. This > means that if your SSL stack is working like crazy doing computations > to the point of maxing out the CPU, chances are that you're under attack > and that your network load is very low. Conversely, if you're dealing > with a lot of network traffic, it usually means your site is designed > so that you deliver a lot of bytes without forcing clients to perform > handshakes all the time. > > So in the end it often makes sense to let both haproxy and the network > stack coexist on all threads of all cores, so that the unused CPU in a > certain situation is available to the other (if only to better deal with > attacks or unexpected traffic surges). > > OpenSSL doesn't scale well, but 1.1.1 isn't that bad. It reaches a plateau > around 16 threads, which will usually represent 10-20k connections/s for > most machines, and it doesn't quickly fall past that point. I am actually also testing HAProxy with wolfSSL (Built with OpenSSL version : wolfSSL 5.6.6, with --enable-aesni --enable-intelasm in the compile options) and on that same machine type it seems to distribute that same amount of load slightly more evenly across the cores, but not by much. I probably should test with a higher load to see if that continues to be the case when we get closer to the capacity of each core. > So if you have 24 cores and 48 threads, it's fine to just remove any "nbthreads" > and "cpu-map" directives from the global section and let haproxy > automatically bind to all CPUs. I read in the latest configuration manual for nbthread it says "On some platforms supporting CPU affinity, the default "nbthread" value is automatically set to the number of CPUs the process is bound to upon startup ... Otherwise, this value defaults to 1", so I was hoping that HAProxy would automatically bind to all CPUs but wasn't 100% sure what the default would be if I omitted cpu-map. It is good to have confirmation that it will bind to all CPUs in that case, I will run with that. > Where there is something to gain is sometimes on the accept() side. > By default when you have one listener bound to many threads, there > is some contention in the system. By starting more listeners you can > reduce or eliminate that contention. The "bind" lines support a > "shards" directive which allows you to indicate how many listening > sockets you want to start, one special value is "by-thread", where > there will be one socket per thread. But it will also let the kernel > decide on your thread based on the source port of the incoming > connection and sometimes it's not as nicely balanced. The default > value is "by-group", which means one listener per thread-group. There's > a default of one thread group, but it's possible to configure more; > that's necessary to go over 64 threads but it makes sense on CPUs > which have high-latency independent L3 caches such as EPYCs of 2nd > and 3rd gen, where you prefer to have one group per core complex and > manually map your threads to the respective CPUs to limit the sharing > between them. > > But I don't know what load level you're aiming for. All tuning has > a cost (including in testing and config maintenance), and can be > counter-productive if done randomly. Up to a few thousands conn/s > I generally suggest to just do nothing and let haproxy configure > itself as it sees fit (i.e. one thread per logical CPU on the first > NUMA node, one thread group). My record was around 140k RSA2048/s on > a huge Sapphire Rapids machine with 224 threads (we were lacking > load generators to reach the absolute limit), but it took weeks of > tuning and code adjustments to reach that level. It's not realistic > for production, but it was fine to kill many bottlenecks we had in > the code and even to improve parts of the SSL engine. Great to learn about this, I was wondering if there was some thread equivalent to the old process bind keyword. We actually do a lot more http requests than https since we get cloudflare to do some of our https for us, the HAProxy frontend on each server is handling around 1K req/s with 200/s of those being SSL. It will be interesting to experiment with this and see the effects on CPU usage and latency even if this is not strictly necessary right now and it would be quite easy to make things worse. Also with by-group and thread-group, because the P cores might be more suited to SSL. > No, it should not change anything. In fact while the CPU time spent on a > key is huge, it's short compared to a network round trip. An SSL key is > roughly 1ms of CPU time. So if you're seeing too many connections, it > would mean you're getting more than 1 connection every millisecond on > a given core. Threading can help in this case to spread the load on more > cores to accept such connections but that's not what you're observing. > > I suspect that what you're observing instead is just the result of your > site being popular enough and browsers performing pre-connect: based on > the browsing habits of their owner, they'll connect to the site in case > they're expected to be about to visit it, so that the connection is > already available. And you can accumulate lots of idle connections like > this. They'll be killed by the http-request timeout if no request comes > on them, but it can definitely count. We've seen some sites having to > significantly increase their max number of connections when this started > to appear a decade ago. > > In any case, if you have some RAM (and I suspect a 24-core machines does), > you should leave large values on the front maxconn so that the kernel > never refrains from accepting connections, because once the kernel's > accept queue is full, the SYN packets are dropped and have to be > retransmitted, which is much slower for the visitor. Very roughly > speaking, an SSL connection during a handshake will take max 100kB of > RAM (it's less but better round it up). This means that you're supposed > to be able to support 10k per GB of RAM assigned to haproxy. This doesn't > include request processing, forwarding to the server, possible handshakes > on the other side, the use of large maps or stick-tables, nor the TCP > stack's buffers. But if you keep in mind 10k conn / GB and you consider > that haproxy will never use more than half of your machine's RAM, you see > that you can be pretty safe with 40k conns on a 8GB machine, 4 of which > are assigned to haproxy. That seems sensible. The vast majority of our sessions are very short lived, we just deliver a response with a link to a CDN etc, so the 50%th percentile from our backend servers is around 3-4ms. I have set up HAProxy with an 8s connect and 30s client and server timeout. Even at 1000 req/s it would be quite easy to fill up a 4096 maxconn (and the default somaxconn/tcp_max_syn_backlog queue) with those timeouts if the backends slowed down. > You're totally right and actually I'm glad you asked, because there are > some power users on this list and maybe some would like to chime in and > share some of their observations, suggestions, even disagree with my > figures above, etc. Tuning is never simple, you need to know what you > want to tune for and in your case that's apparently resource usage and > safety margin, so this involves a lot of parameters, which sometimes > goes down to the choice of hardware vendor (CPU, NICs, etc). > > Willy Absolutely, any tuning tips or up to date HAProxy tuning guides would be really helpful for me and perhaps others reading, I certainly haven't found any. For brevity I omitted a lot of details of our setup and hardware (I should mention we are running HAProxy 2.9 though), and we may just be weird outlier use case that wouldn't be too useful to others, but given we are running on commodity servers with normal NICs and looking to tune resource usages for http and https throughput, I hope there is some more good information out there to be found. Thanks again for your time, Miles

