On Tue, May 27, 2025 at 3:24 AM Eduard Stefes <eduard.ste...@ibm.com> wrote:
> So I worked on the algorithm to also work on buffers between 16-64
> bytes. Then I ran the performance measurement on two
> dataset[^raw_data_1] [^raw_data_2]. And created two diagrams
> [^attachment].
>
> my findings so far:
>
> - the optimized crc32cvx is faster
> - the sb8 performance is heavily depending on alignment (see the
> ripples every 8 bytes)

To be precise, these all seem 8-byte aligned at a glance, and the
ripple is due to input length.

> - the 8 byte ripple is also visible in the vx implementation. As it can
> only perform on 16 or 64 byte chunks, it will still use sb8 for the
> remaining bytes.
> - there is no obvious speed regression in the vx algorithm. Except
> raw_data_2-28 which I assume is a fluke. I am sharing the system with a
> bunch of other devs.
>
>
> I hope this this is acceptable as performance measurement. However we
> will setup a dedicated performance test and try to get precise numbers
> without side-effects. But it may take some time until we get to that.

This already looks like a solid improvement at 32 bytes and above -- I
don't think we need less noisy numbers. Also for future reference,
please reply in-line. Thanks!

--
John Naylor
Amazon Web Services


Reply via email to