Re: High performance HAProxy

Willy Tarreau Wed, 10 Jun 2015 22:56:21 -0700

Hi Eduard,

On Wed, Jun 10, 2015 at 04:56:31AM +0000, Eduard Rushanyan wrote:
> With few folks here we had some learning and already are experiencing quite
> good results with HAProxy. Wanted to first of all share that during the tests
> we achieved up to 45,000 requests per second on SSL on a single 1G box (with
> same setup/hw below). isn't that amazing? :)


It's independant on the network connectivity. It also depends whether
you're doing it in keep-alive, with close and TLS resume, or with a
new renegociation on each request. Given the numbers, I'm assuming
that you're in TLS resume mode, because the numbers would seem high
for renegociation (typically 500-1000 per core) and low for requests
(typically 100000 per core).

> Also wanted to ask for your opinion or advise on how we can possibly improve
> the setup further. It really feels like there is something more out there and
> we could tune up the setup further. 

I'm seeing room for improvement, as it's clear that you're not getting
the most out of your machine. We usually observe around 10000 conn/s
per core in TLS resume, so you're still far from this.

> Our use case is: 
> - high request per second traffic (very high PPS/packet per second) 
> - HTTPS 
> - hundreds of thousands of requests per second 
> - gigabytes of traffic /per second 
> - currently handled by hardware LoadBalancers --> aim to replace hardware
>   LoadBalancers with HAProxy 
> 
> What do we have currently in HAProxy: 
> Rate: 26,000 HTTPS requests per second, per single HAProxy server 
> CPU idle: 50% 
> System avg load: 8 
> Software IRQs %: ~10% 
> 
> What would be great to have: 
> - reduced system load 
> - more idle CPU 
> - ability to push more bandwidth or more requests per second 
> - no Software IRQs (or less), possibly less context switches/interrupts 

You'll have to pick 1 from the last 3 :-)

> Do you think it's possible to further improve current setup
> software/configuration wise? 

First I'm seeing a number of things you can change in your config.

1) all the stats instances can be simplified to a single one with
   all the individual ports, making it much simpler to declare and
   the config easier to read :

   listen stats
       bind :4001 process 1
       bind :4002 process 2
       ...
       bind :4024 process 24
       mode http
       stats enable
       stats hide-version
       stats realm Haproxy\ Statistics
       stats uri /
       stats auth someuser:somepass

2) you didn't specify any process binding in ssl_termination, so the
   kernel wakes all processes with incoming connections, and a few of
   them take some and the other ones go back to sleep. With a kernel
   3.9 or later, you can multiply the "bind" lines and bind each of them
   to a different process. The load will be much better distributed :

   listen ssl_termination
       bind 0.0.0.0:443 process 1 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       bind 0.0.0.0:443 process 2 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       ...

3) you're chaining the SSL instances to the clear-text instance,
   thus doubling the internal connection rate. In general this ensures
   that you have a single process which handles all the traffic, but in
   your case that's not true since all 24 processes can randomly receive
   the connection :

   listen ssl_termination
       bind 0.0.0.0:443 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       server cleartext_http abns@haproxy-clear-listener send-proxy-v2

   frontend cleartext_http
      bind 0.0.0.0:80
      bind abns@haproxy-clear-listener accept-proxy
      default_backend lb_backend

   I'd suggest that you either avoid this bouncing or limit the number
   of processes listening to clear text. If you're fine with running
   the backend and cleartext frontend on all processes, then you can
   simply pu the "default_backend" rule in "ssl_termination" instead
   of "server".

4) as you can see in /proc/interrupts, the ethernet interrupts are
   spread over all threads of the first CPU socket, so the haproxy
   processes running on the same cores are competing with the softirq
   on the same threads, forcing the load to be unequal.

The last point is the trickiest to adjust and you'll have to experiment
a lot. In general, what is important to know :
  - avoid inter-CPU communications as much as possible, as they come
    with latency and cache flushes ;

  - SSL processing is still faster on multiple CPU sockets than what
    you save by avoiding such communications above.

Given that you're limited to 1 Gbps, the softirq load will remain very
low so in my opinion you should limit your IRQs to just a few cores.
Note that using threads for IRQs is interesting because it increases
cache locality and still provides a nice boost (I've observed about
20% perf increase by using 2 threads from the same core over just 1).

Also, avoid mixing haproxy and softirq on the same core or threads of
the same core. It reduces cache hit ratio.

Since you're communicating in clear text to the servers, you must
absolutely have the backend running on the same CPU socket as the
network IRQs.

My suggestion would be to try something like this (and then you can
adjust to experiment) :

- bind eth0 and eth1 IRQs to each thread of the two first cores of
  the first CPU socket. That should be 0, 1, 12, 13 I guess. For
  your load, 4 threads to deal with both NICs irqs should far more
  than enough.

- bind the clear-text frontend+backend to all threads of one or
  two cores of the first CPU socket. Let's try with 2, 3, 14, 15.

- bind the SSL frontend to all other threads.

I suspect that you can easily remove two threads for network interrupts,
and that you can possibly do the same for the clear-text frontend,
except if you manage to reach more than 50-100k conn/s.

If you want to stay on 4 threads for interrupts, then you should be
able to slightly improve the results by binding eth0 to the second
CPU socket only and eth1 to the first one only. Indeed, eth1 will
exclusively be used for clear text while eth0 will exclusively be
used by SSL. So that can improve performance for half of the SSL
threads.

Thus I think you could end up with something like this :

  - threads 0, 12 : eth1 irqs
  - threads 1, 2, 13, 14 : haproxy clear
  - threads 8, 20 : eth0 irqs
  - all other threads : haproxy SSL (16 threads total)

I'd expect that you should reach roughly 80k conn/s with such a setup.

> OpenSSL: 
> ./config --prefix=$LIBSSLBUILD no-shared no-ssl2 no-ssl3 -DOPENSSL_USE_IPV6=0 
> no-err enable-ec_nistp_64_gcc_128 zlib 

You should double-check if the line above enables arch-specific ASM
optimizations or not (look for .o files in some asm/ subdirs). I seem
to remember that it was required to explicitly specify x86_64 somewhere,
but I could be wrong.

> HAProxy: 
> make TARGET=linux2628 CPU=native USE_PCRE=1 USE_OPENSSL=1 USE_ZLIB=1 
> USE_TFO=1 ADDINC=-I$LIBSSLBUILD/include ADDLIB="-L$LIBSSLBUILD/lib -ldl" 

fine.

Willy

Re: High performance HAProxy

Reply via email to