On Sat, May 27, 2023 at 02:56:39PM -0600, Shawn Heisey wrote: > On 5/27/23 02:59, Willy Tarreau wrote: > > The little difference makes me think you've sent your requests over > > a keep-alive connection, which is fine, but which doesn't stress the > > TLS stack anymore. > > Yup. It was using keepalive. I turned keepalive off and repeated the > tests. > > I'm still not seeing a notable difference between the branches, so I have to > wonder whether I need a completely different test. Or whether I simply > don't need to worry about it at all because my traffic needs are so small.
Have you verified that the CPU is saturated ? > Requests per second is down around 60 instead of 1200, and the request time > percentile values went up. At such a low performance it's unlikely that you could hurt the CPU at all, I suspect the limiting factor is the load generator (or there's something else). > I've included two runs per branch here. 24 > threads, each doing 1000 requests. The haproxy logs indicate the page I'm > hitting returns 829 bytes, while the actual index.html is 1187 bytes. I > think gzip compression and the HTTP headers explains the difference. > Without keepalive, the overall test takes a lot longer, which is not > surprising. Without keep-alive nor TLS resume, you should see roughly 1000 connections per second per core, and with TLS resume you should see roughly 4000 conns/s per core. So if you have 12 cores you should see 12000 or 48000 conns/s depending if you're using TLS resume or full rekey. Hmmm are you sure you didn't build the client with OpenSSL 3.0 ? I'm asking because that was our first concern when we tested the perf on Intel's SPR machine. No way to go beyond 400 conn/s, with haproxy totally idle and the client at 100% on 48 cores... The cause was OpenSSL 3. Rebuilding under 1.1.1 jumped to 74000, almost 200 times more! > The high percentiles are not encouraging. 7 seconds to get a web page under > 1kb?, even with 1.1.1t? > > This might be interesting to someone: > > https://asciinema.elyograg.org/haproxyssltest1.html Hmmm host not found here. > I put the project in github. > > https://github.com/elyograg/haproxytestssl I'm seeing everything being done in doGet() but I have no idea about the overhead of the allocations there nor the cost of the lower layers. Maybe there's even some DNS resolution involved, I don't know. That's exactly what I don't like with such languages, they come with tons of pre-defined functions to do whatever but you have no idea how they do them so in the end you don't know what you're testing. Please do me a favor and verify two things: - check the CPU usage using "top" on the haproxy machine during the test - check the CPU usage using "top" on the load geneator machine during the test Until you reach 100% on haproxy you're measuring something else. Please do a comparative check using h1load from a machine having openssl 1.1.1 (e.g. ubuntu 20): git clone https://github.com/wtarreau/h1load/ cd h1load make -j ./h1load -t $(nbproc) -c 240 -r 1 --tls-reuse https://hostname/path This will create 240 concurrent connections to the server, without keep-alive (-r 1 = 1 request per connection), with TLS session resume, and using as many threads as you have CPU cores. You'll see the number of connections per second in the cps column, and the number of requests per second in the rps column. On the left column you'll see the instant number of connections, and on the right you'll see the response time in milliseconds. And please do check that this time the CPU is saturated either on haproxy or on the client. If you have some network latency between the two, you may need to increase the number of connections. You can drop "-r 1" if you want to test with keep-alive. Or you can drop --tls-reuse if you want to test the rekeying performance (for sites that take many new clients making few requests). You can also limit the total number of requests using "-n 24000" for example. Better make sure this number is an integral multiple of the number of connections, even though this is not mandatory at least it's cleaner. Similarly it's better if the number of connections (-c) is an integral multiple of the number of threads (-t) so that each thread is equally loaded. Willy