On Wed, 7 Aug 2024, Richard Biener wrote:
> OK with that change. > > Did you think about a AVX512 version (possibly with 32 byte vectors)? > In case there's a more efficient variant of pshufb/pmovmskb available > there - possibly > the load on the branch unit could be lessened with using masking. Thanks for the idea; unfortunately I don't see any possible improvement. It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the latencies are shorter it seems to be a wash. The only way to use fewer branches seems to be employing longer vectors. (in any case I don't have access to a capable CPU to see for myself) Alexander