On Sat, May 27, 2023 at 02:56:39PM -0600, Shawn Heisey wrote:
> On 5/27/23 02:59, Willy Tarreau wrote:
> > The little difference makes me think you've sent your requests over
> > a keep-alive connection, which is fine, but which doesn't stress the
> > TLS stack anymore.
> 
> Yup.  It was using keepalive.  I turned keepalive off and repeated the
> tests.
> 
> I'm still not seeing a notable difference between the branches, so I have to
> wonder whether I need a completely different test.  Or whether I simply
> don't need to worry about it at all because my traffic needs are so small.

Have you verified that the CPU is saturated ?

> Requests per second is down around 60 instead of 1200, and the request time
> percentile values went up.

At such a low performance it's unlikely that you could hurt the CPU at all,
I suspect the limiting factor is the load generator (or there's something
else).

> I've included two runs per branch here.  24
> threads, each doing 1000 requests.  The haproxy logs indicate the page I'm
> hitting returns 829 bytes, while the actual index.html is 1187 bytes.  I
> think gzip compression and the HTTP headers explains the difference.
> Without keepalive, the overall test takes a lot longer, which is not
> surprising.

Without keep-alive nor TLS resume, you should see roughly 1000 connections
per second per core, and with TLS resume you should see roughly 4000 conns/s
per core. So if you have 12 cores you should see 12000 or 48000 conns/s
depending if you're using TLS resume or full rekey.

Hmmm are you sure you didn't build the client with OpenSSL 3.0 ? I'm asking
because that was our first concern when we tested the perf on Intel's SPR
machine. No way to go beyond 400 conn/s, with haproxy totally idle and the
client at 100% on 48 cores... The cause was OpenSSL 3. Rebuilding under 1.1.1
jumped to 74000, almost 200 times more!

> The high percentiles are not encouraging.  7 seconds to get a web page under
> 1kb?, even with 1.1.1t?
> 
> This might be interesting to someone:
> 
> https://asciinema.elyograg.org/haproxyssltest1.html

Hmmm host not found here.

> I put the project in github.
> 
> https://github.com/elyograg/haproxytestssl

I'm seeing everything being done in doGet() but I have no idea about
the overhead of the allocations there nor the cost of the lower layers.
Maybe there's even some DNS resolution involved, I don't know. That's
exactly what I don't like with such languages, they come with tons of
pre-defined functions to do whatever but you have no idea how they do
them so in the end you don't know what you're testing.

Please do me a favor and verify two things:
  - check the CPU usage using "top" on the haproxy machine during the
    test
  - check the CPU usage using "top" on the load geneator machine during
    the test

Until you reach 100% on haproxy you're measuring something else. Please
do a comparative check using h1load from a machine having openssl 1.1.1
(e.g. ubuntu 20):

  git clone https://github.com/wtarreau/h1load/
  cd h1load
  make -j
  ./h1load -t $(nbproc) -c 240 -r 1 --tls-reuse https://hostname/path

This will create 240 concurrent connections to the server, without
keep-alive (-r 1 = 1 request per connection), with TLS session
resume, and using as many threads as you have CPU cores. You'll
see the number of connections per second in the cps column, and the
number of requests per second in the rps column. On the left column
you'll see the instant number of connections, and on the right you'll
see the response time in milliseconds. And please do check that this
time the CPU is saturated either on haproxy or on the client. If you
have some network latency between the two, you may need to increase
the number of connections. You can drop "-r 1" if you want to test
with keep-alive. Or you can drop --tls-reuse if you want to test the
rekeying performance (for sites that take many new clients making
few requests). You can also limit the total number of requests using
"-n 24000" for example. Better make sure this number is an integral
multiple of the number of connections, even though this is not mandatory
at least it's cleaner. Similarly it's better if the number of connections
(-c) is an integral multiple of the number of threads (-t) so that each
thread is equally loaded.

Willy

Reply via email to