Hi, On 2025-02-14 09:32:32 +0100, Jakub Wartak wrote: > On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <and...@anarazel.de> wrote: > > FWIW, I see substantial performance *regressions* with *big* IO sizes using > > fio. Just looking at cached buffered IO. > > > > for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; > > numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB > > --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered > > 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json > > --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done > > > > io size kB throughput in MB/s > [..] > > 256 16864 > > 512 19114 > > 1024 12874 > [..] > > > It's worth noting that if I boot with mitigations=off clearcpuid=smap I get > > *vastly* better performance: > > > > io size kB throughput in MB/s > [..] > > 128 23133 > > 256 23317 > > 512 25829 > > 1024 15912 > [..] > > Most of the gain isn't due to mitigations=off but clearcpuid=smap. > > Apparently > > SMAP, which requires explicit code to allow kernel space to access userspace > > memory, to make exploitation harder, reacts badly to copying lots of memory. > > > > This seems absolutely bonkers to me. > > There are two bizarre things there, +35% perf boost just like that due > to security drama, and that io_size=512kb being so special to give a > 10-13% boost in Your case? Any ideas, why?
I think there are a few overlapping "cost factors" and that turns out to be the global minimum: - syscall overhead: the fewer the better - memory copy cost: higher for small-ish amounts, then lower - smap costs: seems to increase with larger amounts of memory - CPU cache: copying less than L3 cache will be faster, as otherwise memory bandwidth plays a role > I've got on that Lsv2 > individual MS nvme under Hyper-V, on ext4, which seems to be much more > real world and average Joe situation, and it is much slower, but it is > not showing advantage for blocksize beyond let's say 128: > > io size kB throughput in MB/s > 4 1070 > 8 1117 > 16 1231 > 32 1264 > 64 1249 > 128 1313 > 256 1323 > 512 1257 > 1024 1216 > 2048 1271 > 4096 1304 > 8192 1214 > > top hitter on of course stuff like clear_page_rep [k] and > rep_movs_alternative [k] (that was with mitigations=on). I think you're measuring something different than I was. I was purposefully measuring a fully-cached workload, which worked with that recipe, because I have more than 32GB of RAM available. But I assume you're running this in a VM that doesnt have that much, and thus your're actually bencmarking reading data from disk and - probably more influential in this case - finding buffers to put the newly read data in. Greetings, Andres Freund