v3: https://patchew.org/QEMU/20240206204809.9859-1-amona...@ispras.ru/ v4: https://patchew.org/QEMU/20240215081449.848220-1-richard.hender...@linaro.org/
Changes for v5: - Move 3 byte sample back inline; document it. - Drop AArch64 SVE alternative; neoverse-v2 still recommends simd for memcpy. - Use UMAXV for aarch64 simd reduction 3 cycles on cortex-a76, 2 cycles on neoverse-n1, as compared to UQXTN or CMEQ+SHRN at 4 cycles each. - Add benchmark of zeros. The benchmark is trivial, and could be improved so that it prints the name of the acceleration routine instead of its index in the selection process. But its is good enough to see that #0 is faster than #1, etc. A sample set: Apple M1: buffer_is_zero #0: 135416.27 MB/sec buffer_is_zero #1: 111771.25 MB/sec Neoverse N1: buffer_is_zero #0: 56489.82 MB/sec buffer_is_zero #1: 36347.93 MB/sec i7-1195G7: buffer_is_zero #0: 137327.40 MB/sec buffer_is_zero #1: 69159.20 MB/sec buffer_is_zero #2: 38319.80 MB/sec r~ Alexander Monakov (5): util/bufferiszero: Remove SSE4.1 variant util/bufferiszero: Remove AVX512 variant util/bufferiszero: Reorganize for early test for acceleration util/bufferiszero: Remove useless prefetches util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson (5): util/bufferiszero: Improve scalar variant util/bufferiszero: Introduce biz_accel_fn typedef util/bufferiszero: Simplify test_buffer_is_zero_next_accel util/bufferiszero: Add simd acceleration for aarch64 tests/bench: Add bufferiszero-bench include/qemu/cutils.h | 32 ++- tests/bench/bufferiszero-bench.c | 42 +++ util/bufferiszero.c | 449 +++++++++++++++++-------------- tests/bench/meson.build | 4 +- 4 files changed, 319 insertions(+), 208 deletions(-) create mode 100644 tests/bench/bufferiszero-bench.c -- 2.34.1