Re: [PATCH v4 00/10] Optimize buffer_is_zero

Alexander Monakov Fri, 16 Feb 2024 12:21:25 -0800


On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/15/24 13:37, Alexander Monakov wrote:
> > Ah, I guess you might be running at low perf_event_paranoid setting that
> > allows unprivileged sampling of kernel events? In our submissions the
> > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> > excluding kernel time under syscalls.
> 
> Ok.  Eliminating kernel samples makes things easier to see.
> But I still do not see a 40% reduction in runtime.

I suspect Mikhail's image was less sparse, so the impact from inlining
was greater.

> With this, I see virtually all of the runtime in libz.so.
> Therefore I converted this to raw first, to focus on the issue.

Ah, apologies for that. I built with --disable-default-features and
did not notice my qemu-img lacked support for vmdk and treated it
as a raw image instead. I was assuming it was similar to what Mikhail
used, but obviously it's not due to the compression.

> For avoidance of doubt:
> 
> $ ls -lsh test.raw && sha256sum test.raw
>  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
> 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw
> 
> Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
> Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
> 
> master:
>   57.48%  qemu-img-m  [.] buffer_zero_avx2
>    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
>    2.61%  qemu-img-m  [.] buffer_is_zero
>   63.69%  -- total
> 
> v3:
>   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
>   3.79%  qemu-img-v3  [.] buffer_zero_avx2
>   52.65%  -- total
>     -17%  -- reduction from master
> 
> v4:
>   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
>    3.30%  qemu-img-v4  [.] buffer_zero_avx2
>    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
>   61.07%  -- total
>      -4%  -- reduction from master
> 
> v4+:
>   46.65%  qemu-img  [.] is_allocated_sectors.part.0
>    3.49%  qemu-img  [.] buffer_zero_avx2
>    0.05%  qemu-img  [.] buffer_is_zero_ge256
>   50.19%  -- total
>     -21%  -- reduction from master

Any ideas where the -21% vs v3's -17% difference comes from?

FWIW, in situations like these I always recommend to run perf with fixed
sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
cycles/period=100000/P' to make sample counts between runs of different
duration directly comparable (displayed with 'perf report -n').

> The v4+ puts the 3 byte test back inline, like in your v3.
> 
> Importantly, it must be as 3 short-circuting tests, where my v4 "simplified"
> this to (s | m | e) != 0, on the assumption that the reduced number of
> branches would help.

Yes, we also noticed that when preparing our patch. We also tried mixed
variants like (s | e) != 0 || m != 0, but they did not turn out faster.

> With that settled, I guess we need to talk about how much the out-of-line
> implementation matters at all.  I'm thinking about writing a
> test/bench/bufferiszero, with all-zero buffers of various sizes and
> alignments.  With that it would be easier to talk about whether any given
> implementation is is an improvement for that final 4% not eliminated by the
> three bytes.

Yeah, initially I suggested this task to Mikhail as a practice exercise
outside of Qemu, and we had a benchmark that measures buffer_is_zero via
perf_event_open. This allows to see exactly how close the implementation
runs to the performance ceiling given by max L1 fetch rate (two loads
per cycle on x86).

Alexander

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Reply via email to