On Wed, 7 Aug 2024, Richard Biener wrote:

> OK with that change.
> 
> Did you think about a AVX512 version (possibly with 32 byte vectors)?
> In case there's a more efficient variant of pshufb/pmovmskb available
> there - possibly
> the load on the branch unit could be lessened with using masking.

Thanks for the idea; unfortunately I don't see any possible improvement.
It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the
latencies are shorter it seems to be a wash. The only way to use fewer
branches seems to be employing longer vectors.

(in any case I don't have access to a capable CPU to see for myself)

Alexander

Reply via email to