[llvm-bugs] [Bug 52394] New: [aarch64] Inappropriate optimization: vtstq NEON intrinsic compiled as a sequence of instructions

via llvm-bugs Wed, 03 Nov 2021 18:23:30 -0700

https://bugs.llvm.org/show_bug.cgi?id=52394


            Bug ID: 52394
           Summary: [aarch64] Inappropriate optimization: vtstq NEON
                    intrinsic compiled as a sequence of instructions
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected]

In some cases clang compiles vtstq intrinsic as a sequence of and/cmeq
instructions, instead of just a single cmtst. 

For example:

#include <arm_neon.h>
uint32x4_t foo(uint32x4_t v1, uint32x4_t v2, uint32x4_t v3, uint32x4_t v4)
{
    return vbslq_u32(vtstq_u32(v1, v2), v3, v4);
}

compiles (with -O2 or -Os or even -Oz) to:

        and     v0.16b, v1.16b, v0.16b
        cmeq    v0.4s, v0.4s, #0
        bsl     v0.16b, v3.16b, v2.16b
        ret


The reason for this creativity is unclear - AFAIK, cmtst throughput/latency is
similar to cmeq. 
Anyways, my benchkmarks indicate significant performance degradation for this
reason. The benchmarked case is an unrolled loop mostly comprised of vbslq and
vtstq).

Both GCC and MSVC compile the code above as expected:

        cmtst   v0.4s, v0.4s, v1.4s
        bsl     v0.16b, v2.16b, v3.16b
        ret

-- 
You are receiving this mail because:
You are on the CC list for the bug.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 52394] New: [aarch64] Inappropriate optimization: vtstq NEON intrinsic compiled as a sequence of instructions

Reply via email to