On Wed, 14 Feb 2024, Richard Henderson wrote:

> v3: https://patchew.org/QEMU/20240206204809.9859-1-amona...@ispras.ru/
> 
> Changes for v4:
>   - Keep separate >= 256 entry point, but only keep constant length
>     check inline.  This allows the indirect function call to be hidden
>     and optimized away when the pointer is constant.

Sorry, I don't understand this. Most of the improvement (at least in our
testing) comes from inlining the byte checks, which often fail and eliminate
call overhead entirely. Moving them out-of-line seems to lose most of the
speedup the patchset was bringing, doesn't it? Is there some concern I am
not seeing?

>   - Split out a >= 256 integer routine.
>   - Simplify acceleration selection for testing.
>   - Add function pointer typedef.
>   - Implement new aarch64 accelerations.
> 
> 
> r~
> 
> 
> Alexander Monakov (5):
>   util/bufferiszero: Remove SSE4.1 variant
>   util/bufferiszero: Remove AVX512 variant
>   util/bufferiszero: Reorganize for early test for acceleration
>   util/bufferiszero: Remove useless prefetches
>   util/bufferiszero: Optimize SSE2 and AVX2 variants
> 
> Richard Henderson (5):
>   util/bufferiszero: Improve scalar variant
>   util/bufferiszero: Introduce biz_accel_fn typedef
>   util/bufferiszero: Simplify test_buffer_is_zero_next_accel
>   util/bufferiszero: Add simd acceleration for aarch64
>   util/bufferiszero: Add sve acceleration for aarch64
> 
>  host/include/aarch64/host/cpuinfo.h |   1 +
>  include/qemu/cutils.h               |  15 +-
>  util/bufferiszero.c                 | 500 ++++++++++++++++------------
>  util/cpuinfo-aarch64.c              |   1 +
>  meson.build                         |  13 +
>  5 files changed, 323 insertions(+), 207 deletions(-)
> 
> 

Reply via email to