On Thu, 15 Feb 2024, Richard Henderson wrote:
> On 2/15/24 13:37, Alexander Monakov wrote: > > Ah, I guess you might be running at low perf_event_paranoid setting that > > allows unprivileged sampling of kernel events? In our submissions the > > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only, > > excluding kernel time under syscalls. > > Ok. Eliminating kernel samples makes things easier to see. > But I still do not see a 40% reduction in runtime. I suspect Mikhail's image was less sparse, so the impact from inlining was greater. > With this, I see virtually all of the runtime in libz.so. > Therefore I converted this to raw first, to focus on the issue. Ah, apologies for that. I built with --disable-default-features and did not notice my qemu-img lacked support for vmdk and treated it as a raw image instead. I was assuming it was similar to what Mikhail used, but obviously it's not due to the compression. > For avoidance of doubt: > > $ ls -lsh test.raw && sha256sum test.raw > 12G -rw-r--r-- 1 rth rth 40G Feb 15 21:14 test.raw > 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80 test.raw > > Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz > Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) > > master: > 57.48% qemu-img-m [.] buffer_zero_avx2 > 3.60% qemu-img-m [.] is_allocated_sectors.part.0 > 2.61% qemu-img-m [.] buffer_is_zero > 63.69% -- total > > v3: > 48.86% qemu-img-v3 [.] is_allocated_sectors.part.0 > 3.79% qemu-img-v3 [.] buffer_zero_avx2 > 52.65% -- total > -17% -- reduction from master > > v4: > 54.60% qemu-img-v4 [.] buffer_is_zero_ge256 > 3.30% qemu-img-v4 [.] buffer_zero_avx2 > 3.17% qemu-img-v4 [.] is_allocated_sectors.part.0 > 61.07% -- total > -4% -- reduction from master > > v4+: > 46.65% qemu-img [.] is_allocated_sectors.part.0 > 3.49% qemu-img [.] buffer_zero_avx2 > 0.05% qemu-img [.] buffer_is_zero_ge256 > 50.19% -- total > -21% -- reduction from master Any ideas where the -21% vs v3's -17% difference comes from? FWIW, in situations like these I always recommend to run perf with fixed sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e cycles/period=100000/P' to make sample counts between runs of different duration directly comparable (displayed with 'perf report -n'). > The v4+ puts the 3 byte test back inline, like in your v3. > > Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" > this to (s | m | e) != 0, on the assumption that the reduced number of > branches would help. Yes, we also noticed that when preparing our patch. We also tried mixed variants like (s | e) != 0 || m != 0, but they did not turn out faster. > With that settled, I guess we need to talk about how much the out-of-line > implementation matters at all. I'm thinking about writing a > test/bench/bufferiszero, with all-zero buffers of various sizes and > alignments. With that it would be easier to talk about whether any given > implementation is is an improvement for that final 4% not eliminated by the > three bytes. Yeah, initially I suggested this task to Mikhail as a practice exercise outside of Qemu, and we had a benchmark that measures buffer_is_zero via perf_event_open. This allows to see exactly how close the implementation runs to the performance ceiling given by max L1 fetch rate (two loads per cycle on x86). Alexander