On Wed, 14 Feb 2024, Richard Henderson wrote:
> v3: https://patchew.org/QEMU/20240206204809.9859-1-amona...@ispras.ru/ > > Changes for v4: > - Keep separate >= 256 entry point, but only keep constant length > check inline. This allows the indirect function call to be hidden > and optimized away when the pointer is constant. Sorry, I don't understand this. Most of the improvement (at least in our testing) comes from inlining the byte checks, which often fail and eliminate call overhead entirely. Moving them out-of-line seems to lose most of the speedup the patchset was bringing, doesn't it? Is there some concern I am not seeing? > - Split out a >= 256 integer routine. > - Simplify acceleration selection for testing. > - Add function pointer typedef. > - Implement new aarch64 accelerations. > > > r~ > > > Alexander Monakov (5): > util/bufferiszero: Remove SSE4.1 variant > util/bufferiszero: Remove AVX512 variant > util/bufferiszero: Reorganize for early test for acceleration > util/bufferiszero: Remove useless prefetches > util/bufferiszero: Optimize SSE2 and AVX2 variants > > Richard Henderson (5): > util/bufferiszero: Improve scalar variant > util/bufferiszero: Introduce biz_accel_fn typedef > util/bufferiszero: Simplify test_buffer_is_zero_next_accel > util/bufferiszero: Add simd acceleration for aarch64 > util/bufferiszero: Add sve acceleration for aarch64 > > host/include/aarch64/host/cpuinfo.h | 1 + > include/qemu/cutils.h | 15 +- > util/bufferiszero.c | 500 ++++++++++++++++------------ > util/cpuinfo-aarch64.c | 1 + > meson.build | 13 + > 5 files changed, 323 insertions(+), 207 deletions(-) > >