> Slow enough for what? I agree that if we run the tests over an actual > physical > network, a system with a fast enough CPU will saturate the network and then > it > mostly won't matter how fast or not curl is. > > For this kind of test to be sensible, we need to make sure to either have a > faster pipe than can be saturated by a single CPU core or do a test setup > that > can't do it due to complexity. It seems easiest to accomplish this by doing > transfers on localhost.
Exactly, and much better worded than what I wrote! > When doing transfers on localhost I don't think it matters much exactly how > fast the CPU is, and I'm convinced we will see deltas between versions > whichever CPU we use. In fact, I beleive I've already spotted some. I'm just > not ready yet to draw the conclusions nor to start working on figuring out > why > they exist. Doing full transfers makes sense if you want to test several parts of the library code. It didn't make sense in my case, I know TLS handshake is (terribly) slow and that it is needed for all transfers (the server closes the connection even with Connection:keepalive!), so I was testing only transfer speed. > I've provided scripts in the curl/relative directory now that can: > > 1. build 'sprinter' the test tool > 2. build (lib)curl for a number of versions and install them locally > 3. run sprinter with each of those built versions > > It seems most interesting to do A LOT of smaller transfers with a fairly huge > concurrency. I've played with doing 100,000 4K transfers at 100 at a time and > with 6 8GB transfers at 2 at a time, and the latter will mostly just saturate > the memory bandwidth in the machine. > > The output for the sprinter runs is not easily "comparable" yet and there's > no > machine help to detect regressions etc but it's a decent start I think. > > I'm curious if others will see the same thing I seem to see right now... You probably won't fall into the same *pitfall* I fell in, because even if you have 100 transfers in parallel in a "multi", you are still "single threaded". What explained that in my case I got times varying from 7:00 to 11:30 for the exact same transfer is a side effect of multi-threading. I happens that my test laptop runs with the Conservative governor. As fuse works, by default it multi-threads its requests, which are then served with the same curl handle that must jump from one core to the other. Not only that is bad for cache bouncing, but with the conservative governor it didn't trigger enough workload for the cores to be stepped up in frequency. Hence the whole transfer was done at the lowest frequency! Running fuse "single threaded" (-s option) or my old algorithm that uses a "worker stream" works better because the load on the "worker" or single thread is enough so that the governor pushes the frequency to max allowed. But measures started to be far more steady when I set all the cores to a single frequency. For the same reason, on my more recent desktop, minimum frequency being enough to cope with the workload, the measures were super steady, and none of the four very different algorithms made any difference (looking at the wall time only). Possibly also, one of the reason you find it better to have 100 transfers in parallel, is that it makes some load and the core is bumped to max frequency very soon, and stays there. Better safe than sorry, to be sure you eliminate this bias, I suggest you lock your machine at a given frequency before running the test script. If you don't lock at a given frequency and kernel+ACPI decides to change the frequency in the middle of the test, the standard deviation on your measures might increase to a point they become quite hard to make any sense! For instance, 1600MHz being the max on my laptop, adapt to your machine and number of cores, and to the governors allowed on your machine. Any governor is Ok since you set a unique frequency: for core in $( seq 0 7 ); do sudo cpufreq-set -c $core -f 1600; sudo cpufreq-set -c $core -g userspace; done Check with for core in $( seq 0 7 ); do sudo cpufreq-info -c $core -p; done (There is not such kind of locking at the moment in your script, and it is hard to code it since it depends on your processor... and you probably don't want to run a curl test as sudo, or even have 'sudos' in your script!) I also run my tests with 'perf', it gives interesting figures about cache/cache miss... (out of topic) the famous effect of "memcpy" on which we disagreed is very real on a "slow" machine, believe me, and shows in the wall time! It's not a mystery why kernel guys tried to eliminate it (especially with fuse) with the "splicing" technique. Having to copy the standard 16k curl buffer not only takes instruction time, but completely flushes the L1 data cache on my laptop processor ('only' 32k data cache per core). 'perf' shows it. This probably makes the most of the difference between using "curl with pause" or the raw curl_easy_recv since with the later the copy is not necessary because it receives the address where to place the data. Nevertheless, I agree with you, it is hard to interpret all these figures, and mathematically impossible to optimise for every different cases, especially for a library. For instance, I saw you did a great job optimising the size of the main curl structure over time. For some of my algorithms, further 'nano-optimisation' could be obtained by clever structure placement to minimise the number of "cache lines" the core has to refresh. But another person using another protocol, would need other placement... Rest assure, as I said in summary on the previous e-mail, on a "recent enough desktop" (5.5 years my desktop is "recent enough"), none of that makes any difference. But same as you, I try to find the "right algorithm" since it does make a difference on my antique laptop, but also on the very recent Raspberry Pi 4! Cheers an keep up the good job. Alain ------------------------------------------------------------------- Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library Etiquette: https://curl.se/mail/etiquette.html