Issue 171746
Summary Another case where AVX-512 mask moves are being used when they are not needed
Labels new issue
Assignees
Reporter Whatcookie
    Hello, here is a situation where AVX-512 is generating worse code than AVX2. In the godbolt link, I have included the LLVM IR that RPCS3 (Playstation 3 emulator) produces for emulation of the SPU instructions GB, and GBH. These instructions take the least significant bits from each element, and place it at the end of a vector register. Essentially it's like pmovmsk* except it places the result in a vector register instead of a GPR, (no GPRs on SPUs) and it takes the least signicant bit instead of the most significant bit. 

For GB, using the mask registers are suboptimal, as it just adds an extra instruction: move to mask register, move to gpr, move to xmm.
The AVX2 path (znver3 in the link) is optimal, as it just moves from xmm to gpr, then back again.

For GBH, since there is no pmovmskw, the existing AVX-512 output is ideal. Though note that RPCS3 uses a slick trick with gf2p8affineqb to emulate GBH with just 2 instructions: https://github.com/RPCS3/rpcs3/pull/14669


https://godbolt.org/z/d1zbaqEdq

In short:

AVX2 output (good)

        vmovmskps       eax, xmm0
 vmovd   xmm0, eax

AVX512 output (bad)

        vpmovd2m        k0, xmm0
        kmovb   eax, k0
        vmovd   xmm0, eax

Thanks!
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to