On Tue, May 27, 2025 at 3:24 AM Eduard Stefes <eduard.ste...@ibm.com> wrote: > So I worked on the algorithm to also work on buffers between 16-64 > bytes. Then I ran the performance measurement on two > dataset[^raw_data_1] [^raw_data_2]. And created two diagrams > [^attachment]. > > my findings so far: > > - the optimized crc32cvx is faster > - the sb8 performance is heavily depending on alignment (see the > ripples every 8 bytes)
To be precise, these all seem 8-byte aligned at a glance, and the ripple is due to input length. > - the 8 byte ripple is also visible in the vx implementation. As it can > only perform on 16 or 64 byte chunks, it will still use sb8 for the > remaining bytes. > - there is no obvious speed regression in the vx algorithm. Except > raw_data_2-28 which I assume is a fluke. I am sharing the system with a > bunch of other devs. > > > I hope this this is acceptable as performance measurement. However we > will setup a dedicated performance test and try to get precise numbers > without side-effects. But it may take some time until we get to that. This already looks like a solid improvement at 32 bytes and above -- I don't think we need less noisy numbers. Also for future reference, please reply in-line. Thanks! -- John Naylor Amazon Web Services