Changes from v2 to v3: * Unit testing. This includes having x86 attempt all versions of the accelerator that will run on the hardware. Thus an avx2 host will run the basic test 5 times (1.5sec on my laptop).
* Drop the ppc and aarch64 specializations. I have improved the basic integer version to the point that those vectorized versions are not a win. In the case of my aarch64 mustang, the integer version is 4 times faster than the neon version that I delete. With effort I was able to rewrite the neon version to come to within a factor of 1.1, but it remained slower than the integer. To be fair, gcc6 makes very good use of ldp, so the integer path is *also* loading 16 bytes per insn. I can forward my standalone aarch64 benchmark if anyone is interested. Note however that at least the avx2 acceleration is still very much a win, being about 3 times faster on my laptop. Of course, it's handling 4 times as much data per loop as the integer version, so one can still see the overhead caused by using vector insns. For grins I wrote an avx512 version, if someone has a skylake upon which to test and benchmark. That requires additional configure checks, so I didn't bother to include it here. r~ Richard Henderson (9): cutils: Move buffer_is_zero and subroutines to a new file cutils: Remove SPLAT macro cutils: Export only buffer_is_zero cutils: Rearrange buffer_is_zero acceleration cutils: Add test for buffer_is_zero cutils: Add generic prefetch cutils: Rewrite x86 buffer zero checking cutils: Remove aarch64 buffer zero checking cutils: Remove ppc buffer zero checking configure | 21 +-- include/qemu/cutils.h | 3 +- migration/ram.c | 2 +- migration/rdma.c | 5 +- tests/Makefile.include | 3 + tests/test-bufferiszero.c | 78 +++++++++++ util/Makefile.objs | 1 + util/bufferiszero.c | 332 ++++++++++++++++++++++++++++++++++++++++++++++ util/cutils.c | 244 ---------------------------------- 9 files changed, 423 insertions(+), 266 deletions(-) create mode 100644 tests/test-bufferiszero.c create mode 100644 util/bufferiszero.c -- 2.7.4