On 07/04/2016 14:54, Michael S. Tsirkin wrote: > > char check_zero(char *p, int len) > { > char res = 0; > int i; > > for (i = 0; i < len; i++) { > res = res | p[i]; > } > > return res; > } > > > If you compile this function with --tree-vectorize and --unroll-loops.
What you get then is exactly the same as what we already have in QEMU, except for: - the QEMU one has 128 extra instructions (32 times pcmpeq, movmsk, cmp, je) in the loop. Those extra instructions probably are free because, in the case where the function goes through the whole buffer, the cache misses dominate despite the efforts of the hardware prefetcher - the QEMU one has an extra small loop at the beginning that proceeds a word at a time to catch the case where almost everything in the page is nonzero. > Now, this version always scans all of the buffer, so > it will be slower when buffer is *not* all-zeroes. This is by far the common case. > Which might indicate that you need to know what your > workload is to implement compare to zero efficiently, Not necessarily. The two cases (unrolled/higher setup cost, and non-unrolled/lower setup cost) are the same as the "parallel" and "sequential" parts in Amdahl's law, and they optimize for completely opposite workloads. Amdahl's law then tells you that by making the non-unrolled part small enough you can get very close to the absolute maximum speedup. Now of course if you know that your workload is "almost everything is zero except a few bytes at the end of the page" then you have the problem that your workload sucks and you should hate the guy who wrote the software running in the guest. :) Paolo