Hi Willy, I am happy to follow-up on the thread. Long story short - based on your suggestions we did further experiments with the setup and good news is things got improved. Thank you. Short summary would be: - cpu idle increased from 50%->80% - system avg load decreased from 8 -> 3 - software irq down from 10% -> 0-1% We think one node can now handle at least 60.000 https req/s - we heard about higher numbers, but I think this is quite an achievement already!
A little bit more in details. For the experiments we have been using three servers which have same hw specs. All tests were conducted on those with same QPS up to 45000 https req/s. At first we tried removing the chaining of ssl to single socket as you suggested in 3). This did not show visible improvements on the software irq but reduced the average load on the system. From 8 to 5 or less. Next stage we pinned ssl_termination, cleartext_http, and lb_backend to processes and split IRQ affinity for eth1 to cpu0 & 12 and eth0 to cpu8 & 20. This resulted in a noticeable improvement in all three areas mentioned above. That is our current production configuration. What is fascinating about such configuration is that we applied more traffic but the system load did not increase significantly: at 45.000 req/s the cpu idle was still 75% and avg system load of 5. Third stage we have upgraded one of the nodes to Debian 8 to evaluate if SO_REUSEPORT from kernel 3.9+ will make any difference (also adjusted haproxy config as suggested in 2). We did not observe any visible improvements and/or we did not get the system loaded enough maybe. However we noticed that system load fluctuates less with spiky/burst inbound traffic. Probably that's a sign of a system getting more stable. For now, we decided not to go with SO_REUSEPORT. We did additional experiments like setting NIC irq to a single thread/Physical CPU with backend, split NIC queue affinities, etc without any noticeable improvement. Not sure if we have room for TCP tweaks, that we haven't try changing during the experiments. And just in case, our final haproxy.cfg looks as follows: global daemon log 127.0.0.1 local0 maxconn 100000 nbproc 24 cpu-map 1 0 cpu-map 2 1 cpu-map 3 2 cpu-map 4 3 cpu-map 5 4 cpu-map 6 5 cpu-map 7 6 cpu-map 8 7 cpu-map 9 8 cpu-map 10 9 cpu-map 11 10 cpu-map 12 11 cpu-map 13 12 cpu-map 14 13 cpu-map 15 14 cpu-map 16 15 cpu-map 17 16 cpu-map 18 17 cpu-map 19 18 cpu-map 20 19 cpu-map 21 20 cpu-map 22 21 cpu-map 23 22 cpu-map 24 23 tune.bufsize 16384 spread-checks 4 tune.maxrewrite 1024 tune.maxpollevents 100 tune.ssl.default-dh-param 2048 pidfile /var/run/haproxy.pid stats socket 0.0.0.0:2001 process 1 stats socket 0.0.0.0:2002 process 2 stats socket 0.0.0.0:2003 process 3 stats socket 0.0.0.0:2004 process 4 stats socket 0.0.0.0:2005 process 5 stats socket 0.0.0.0:2006 process 6 stats socket 0.0.0.0:2007 process 7 stats socket 0.0.0.0:2008 process 8 stats socket 0.0.0.0:2009 process 9 stats socket 0.0.0.0:2010 process 10 stats socket 0.0.0.0:2011 process 11 stats socket 0.0.0.0:2012 process 12 stats socket 0.0.0.0:2013 process 13 stats socket 0.0.0.0:2014 process 14 stats socket 0.0.0.0:2015 process 15 stats socket 0.0.0.0:2016 process 16 stats socket 0.0.0.0:2017 process 17 stats socket 0.0.0.0:2018 process 18 stats socket 0.0.0.0:2019 process 19 stats socket 0.0.0.0:2020 process 20 stats socket 0.0.0.0:2021 process 21 stats socket 0.0.0.0:2022 process 22 stats socket 0.0.0.0:2023 process 23 stats socket 0.0.0.0:2024 process 24 defaults mode http timeout connect 30s timeout client 60s timeout server 30s timeout queue 60s timeout http-request 30s timeout http-keep-alive 30s option redispatch option tcplog option dontlog-normal option http-keep-alive option splice-auto option http-no-delay log global listen stats bind :4001 process 1 bind :4002 process 2 bind :4003 process 3 bind :4004 process 4 bind :4005 process 5 bind :4006 process 6 bind :4007 process 7 bind :4008 process 8 bind :4009 process 9 bind :4010 process 10 bind :4011 process 11 bind :4012 process 12 bind :4013 process 13 bind :4014 process 14 bind :4015 process 15 bind :4016 process 16 bind :4017 process 17 bind :4018 process 18 bind :4019 process 19 bind :4020 process 20 bind :4021 process 21 bind :4022 process 22 bind :4023 process 23 bind :4024 process 24 mode http stats enable stats hide-version stats realm Haproxy\ Statistics stats uri / stats auth someuser:somepass listen ssl_termination bind :443 ssl crt /webapps/ssl/haproxy.new.crt ciphers AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-sslv3 bind-process 4 5 6 7 8 10 11 12 16 17 18 20 22 23 24 default_backend lb_backend frontend cleartext_http bind :80 default_backend lb_backend bind-process 2 3 14 15 backend lb_backend mode http fullconn 100000 option httpchk HEAD /lbcheck.jsp HTTP/1.0 option accept-invalid-http-response option forwardfor balance roundrobin server node111 172.22.18.75:80 check server node112 172.22.18.76:80 check server node113 172.22.18.77:80 check server node114 172.22.18.78:80 check server node115 172.22.18.79:80 check server node116 172.22.18.80:80 check server node117 172.22.18.81:80 check server node118 172.22.18.82:80 check server node119 172.22.18.83:80 check server node120 172.22.18.84:80 check server node121 172.22.18.93:80 check server node122 172.22.18.94:80 check bind-process 2 3 14 15 On Thursday, June 11, 2015 1:55 AM, Willy Tarreau <[email protected]> wrote: Hi Eduard, On Wed, Jun 10, 2015 at 04:56:31AM +0000, Eduard Rushanyan wrote: > With few folks here we had some learning and already are experiencing quite > good results with HAProxy. Wanted to first of all share that during the tests > we achieved up to 45,000 requests per second on SSL on a single 1G box (with > same setup/hw below). isn't that amazing? :) It's independant on the network connectivity. It also depends whether you're doing it in keep-alive, with close and TLS resume, or with a new renegociation on each request. Given the numbers, I'm assuming that you're in TLS resume mode, because the numbers would seem high for renegociation (typically 500-1000 per core) and low for requests (typically 100000 per core). > Also wanted to ask for your opinion or advise on how we can possibly improve > the setup further. It really feels like there is something more out there and > we could tune up the setup further. I'm seeing room for improvement, as it's clear that you're not getting the most out of your machine. We usually observe around 10000 conn/s per core in TLS resume, so you're still far from this. > Our use case is: > - high request per second traffic (very high PPS/packet per second) > - HTTPS > - hundreds of thousands of requests per second > - gigabytes of traffic /per second > - currently handled by hardware LoadBalancers --> aim to replace hardware > LoadBalancers with HAProxy > > What do we have currently in HAProxy: > Rate: 26,000 HTTPS requests per second, per single HAProxy server > CPU idle: 50% > System avg load: 8 > Software IRQs %: ~10% > > What would be great to have: > - reduced system load > - more idle CPU > - ability to push more bandwidth or more requests per second > - no Software IRQs (or less), possibly less context switches/interrupts You'll have to pick 1 from the last 3 :-) > Do you think it's possible to further improve current setup > software/configuration wise? First I'm seeing a number of things you can change in your config. 1) all the stats instances can be simplified to a single one with all the individual ports, making it much simpler to declare and the config easier to read : listen stats bind :4001 process 1 bind :4002 process 2 ... bind :4024 process 24 mode http stats enable stats hide-version stats realm Haproxy\ Statistics stats uri / stats auth someuser:somepass 2) you didn't specify any process binding in ssl_termination, so the kernel wakes all processes with incoming connections, and a few of them take some and the other ones go back to sleep. With a kernel 3.9 or later, you can multiply the "bind" lines and bind each of them to a different process. The load will be much better distributed : listen ssl_termination bind 0.0.0.0:443 process 1 ssl crt /webapps/ssl/haproxy.new.crt ciphers AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3 bind 0.0.0.0:443 process 2 ssl crt /webapps/ssl/haproxy.new.crt ciphers AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3 ... 3) you're chaining the SSL instances to the clear-text instance, thus doubling the internal connection rate. In general this ensures that you have a single process which handles all the traffic, but in your case that's not true since all 24 processes can randomly receive the connection : listen ssl_termination bind 0.0.0.0:443 ssl crt /webapps/ssl/haproxy.new.crt ciphers AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3 server cleartext_http abns@haproxy-clear-listener send-proxy-v2 frontend cleartext_http bind 0.0.0.0:80 bind abns@haproxy-clear-listener accept-proxy default_backend lb_backend I'd suggest that you either avoid this bouncing or limit the number of processes listening to clear text. If you're fine with running the backend and cleartext frontend on all processes, then you can simply pu the "default_backend" rule in "ssl_termination" instead of "server". 4) as you can see in /proc/interrupts, the ethernet interrupts are spread over all threads of the first CPU socket, so the haproxy processes running on the same cores are competing with the softirq on the same threads, forcing the load to be unequal. The last point is the trickiest to adjust and you'll have to experiment a lot. In general, what is important to know : - avoid inter-CPU communications as much as possible, as they come with latency and cache flushes ; - SSL processing is still faster on multiple CPU sockets than what you save by avoiding such communications above. Given that you're limited to 1 Gbps, the softirq load will remain very low so in my opinion you should limit your IRQs to just a few cores. Note that using threads for IRQs is interesting because it increases cache locality and still provides a nice boost (I've observed about 20% perf increase by using 2 threads from the same core over just 1). Also, avoid mixing haproxy and softirq on the same core or threads of the same core. It reduces cache hit ratio. Since you're communicating in clear text to the servers, you must absolutely have the backend running on the same CPU socket as the network IRQs. My suggestion would be to try something like this (and then you can adjust to experiment) : - bind eth0 and eth1 IRQs to each thread of the two first cores of the first CPU socket. That should be 0, 1, 12, 13 I guess. For your load, 4 threads to deal with both NICs irqs should far more than enough. - bind the clear-text frontend+backend to all threads of one or two cores of the first CPU socket. Let's try with 2, 3, 14, 15. - bind the SSL frontend to all other threads. I suspect that you can easily remove two threads for network interrupts, and that you can possibly do the same for the clear-text frontend, except if you manage to reach more than 50-100k conn/s. If you want to stay on 4 threads for interrupts, then you should be able to slightly improve the results by binding eth0 to the second CPU socket only and eth1 to the first one only. Indeed, eth1 will exclusively be used for clear text while eth0 will exclusively be used by SSL. So that can improve performance for half of the SSL threads. Thus I think you could end up with something like this : - threads 0, 12 : eth1 irqs - threads 1, 2, 13, 14 : haproxy clear - threads 8, 20 : eth0 irqs - all other threads : haproxy SSL (16 threads total) I'd expect that you should reach roughly 80k conn/s with such a setup. > OpenSSL: > ./config --prefix=$LIBSSLBUILD no-shared no-ssl2 no-ssl3 -DOPENSSL_USE_IPV6=0 > no-err enable-ec_nistp_64_gcc_128 zlib You should double-check if the line above enables arch-specific ASM optimizations or not (look for .o files in some asm/ subdirs). I seem to remember that it was required to explicitly specify x86_64 somewhere, but I could be wrong. > HAProxy: > make TARGET=linux2628 CPU=native USE_PCRE=1 USE_OPENSSL=1 USE_ZLIB=1 > USE_TFO=1 ADDINC=-I$LIBSSLBUILD/include ADDLIB="-L$LIBSSLBUILD/lib -ldl" fine. Willy

