Hi,

On 2025-02-14 09:32:32 +0100, Jakub Wartak wrote:
> On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <and...@anarazel.de> wrote:
> > FWIW, I see substantial performance *regressions* with *big* IO sizes using
> > fio. Just looking at cached buffered IO.
> >
> > for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; 
> > numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB 
> > --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 
> > 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json 
> > --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done
> >
> > io size kB      throughput in MB/s
> [..]
> > 256             16864
> > 512             19114
> > 1024            12874
> [..]
> 
> > It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
> > *vastly* better performance:
> >
> > io size kB      throughput in MB/s
> [..]
> > 128             23133
> > 256             23317
> > 512             25829
> > 1024            15912
> [..]
> > Most of the gain isn't due to mitigations=off but clearcpuid=smap. 
> > Apparently
> > SMAP, which requires explicit code to allow kernel space to access userspace
> > memory, to make exploitation harder, reacts badly to copying lots of memory.
> >
> > This seems absolutely bonkers to me.
> 
> There are two bizarre things there, +35% perf boost just like that due
> to security drama, and that io_size=512kb being so special to give a
> 10-13% boost in Your case? Any ideas, why?

I think there are a few overlapping "cost factors" and that turns out to be
the global minimum:
- syscall overhead: the fewer the better
- memory copy cost: higher for small-ish amounts, then lower
- smap costs: seems to increase with larger amounts of memory
- CPU cache: copying less than L3 cache will be faster, as otherwise memory
  bandwidth plays a role



> I've got on that Lsv2
> individual MS nvme under Hyper-V, on ext4, which seems to be much more
> real world and average Joe situation, and it is much slower, but it is
> not showing advantage for blocksize beyond let's say 128:
> 
> io size kB      throughput in MB/s
> 4        1070
> 8        1117
> 16        1231
> 32        1264
> 64        1249
> 128        1313
> 256        1323
> 512        1257
> 1024    1216
> 2048    1271
> 4096    1304
> 8192    1214
> 
> top hitter on of course stuff like clear_page_rep [k] and
> rep_movs_alternative [k] (that was with mitigations=on).

I think you're measuring something different than I was. I was purposefully
measuring a fully-cached workload, which worked with that recipe, because I
have more than 32GB of RAM available. But I assume you're running this in a VM
that doesnt have that much, and thus your're actually bencmarking reading data
from disk and - probably more influential in this case - finding buffers to
put the newly read data in.

Greetings,

Andres Freund


Reply via email to