Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

Paolo Bonzini Thu, 07 Apr 2016 06:56:57 -0700


On 07/04/2016 14:54, Michael S. Tsirkin wrote:
> 
> char check_zero(char *p, int len)
> {
>     char res = 0;
>     int i;
> 
>     for (i = 0; i < len; i++) {
>         res = res | p[i];
>     }
> 
>     return res;
> }
> 
> 
> If you compile this function with --tree-vectorize and --unroll-loops.


What you get then is exactly the same as what we already have in QEMU,
except for:

- the QEMU one has 128 extra instructions (32 times pcmpeq, movmsk, cmp,
je) in the loop.  Those extra instructions probably are free because, in
the case where the function goes through the whole buffer, the cache
misses dominate despite the efforts of the hardware prefetcher

- the QEMU one has an extra small loop at the beginning that proceeds a
word at a time to catch the case where almost everything in the page is
nonzero.

> Now, this version always scans all of the buffer, so
> it will be slower when buffer is *not* all-zeroes.

This is by far the common case.

> Which might indicate that you need to know what your
> workload is to implement compare to zero efficiently,

Not necessarily.  The two cases (unrolled/higher setup cost, and
non-unrolled/lower setup cost) are the same as the "parallel" and
"sequential" parts in Amdahl's law, and they optimize for completely
opposite workloads.  Amdahl's law then tells you that by making the
non-unrolled part small enough you can get very close to the absolute
maximum speedup.

Now of course if you know that your workload is "almost everything is
zero except a few bytes at the end of the page" then you have the
problem that your workload sucks and you should hate the guy who wrote
the software running in the guest. :)

Paolo

Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

Reply via email to