[llvm-bugs] [Bug 26110] clang c compiler produces wrong result for the attached c code with -O2 optimzation

via llvm-bugs Wed, 10 Feb 2016 17:20:06 -0800

https://llvm.org/bugs/show_bug.cgi?id=26110


Ahmed Bougacha <ahmed.bouga...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
                 CC|                            |ahmed.bouga...@gmail.com
          Component|LLVM Codegen                |Backend: X86
            Version|3.7                         |trunk
         Resolution|FIXED                       |---
           Assignee|unassignedclangbugs@nondot. |unassignedb...@nondot.org
                   |org                         |
            Product|clang                       |libraries

--- Comment #3 from Ahmed Bougacha <ahmed.bouga...@gmail.com> ---
So; I looked a little closer. Sanjay's bisect was correct. clang-700 is pretty
old now; I bisected to:
  r229099 [SimplifyCFG] Be more aggressive

Sure enough, this still reproduces on trunk with -mllvm
-phi-node-folding-threshold=1.

Long story short: the problematic pattern is:
  (c ? -v : v)

which we lower to (because "c" is <4 x i1>, lowered as a vector mask):
  (~c & v) | (c & -v)

roughly corresponding to this IR:
  define <4 x i32> @t(<4 x i32> %v, <4 x i32> %c) {
    %cl = shl <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
    %cs = ashr <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
    %tmp2 = trunc <4 x i32> %cs to <4 x i1>
    ; ^ not as artificial as it looks: equivalent to a legalized vsetcc

    %mv = sub nsw <4 x i32> zeroinitializer, %v
    %r = select <4 x i1> %tmp2, <4 x i32> %v, <4 x i32> %mv
    ret <4 x i32> %r
  }


The SSE2 codegen is pretty straightforward:

    xorps  %xmm1, %xmm1
    ...                   # xmm6 <- %v
    ...                   # xmm3 <- %c
    psubd  %xmm6, %xmm1   # 0 - v                # 0 - 5 -> -5
    movaps %xmm3, %xmm0   # c                    # 0 -> 0
    pandn  %xmm6, %xmm0   # ~c & v               # ~0 & 5 -> 5
    pand   %xmm3, %xmm1   # c & -v               # -5 & 0 -> 0
    por    %xmm0, %xmm1   # (~c & v) | (c & -v)  # 0 | 5 -> 5

However when we have SSSE3 (the default on OS X), we try to match it to PSIGND,
instead doing:

    psignd    %xmm3, %xmm1    # (c < 0 ? -v : (c > 0 ? v : 0))
                              #   c is a mask, so (c > 0) == 0
                              # (c ? -v : 0)
                              # (0 ? -5 : 0)
                              #   -> 0

Which is not equivalent; one does:
  (c ? -v : 0)
the other:
  (c ? -v : v)



Now. This bug existed since 2010. However, I think we don't know about this
issue because of operand canonicalization.

The PSIGN combine matches:
  (or (and m, x), (pandn m, (0 - x)))
  (or (and x, m), (pandn m, (0 - x)))
  (or (pandn m, (0 - x)), (and m, x))
  (or (pandn m, (0 - x)), (and x, m))

but not the variants of:
  (or (and m, (0 - x)), (pandn m, x))

Which is what gets generated for the function above (the most obvious IR that I
could write).



I think this is pretty easy to fix: instead of using c as a mask, put any
non-sign bit in there, to default to the 'v' case.

So, this should work:
    por       <1,1,1,1>, %xmm3 # c' = c | 1
    psignd    %xmm3, %xmm1     # (c' < 0 ? -v : (c' > 0 ? v : 0))
                               #   c is a mask, so c' is either 1 or 0xff..f
                               # (c' == 0xff..f ? -v : (c' != 0 ? v : v))
                               # (c' == 0xff..f ? -v : v)
                               # (0 ? -5 : 5)
                               #   -> 5

CP loads are cheap, so this is probably still a win over the SSE2 codegen:

    psrad    $31, %xmm1
    pxor    %xmm2, %xmm2
    psubd    %xmm0, %xmm2
    pand    %xmm1, %xmm2
    pandn    %xmm0, %xmm1
    por    %xmm1, %xmm2
    movdqa    %xmm2, %xmm0

Note that I don't think the couple of PSIGN tests in trunk are correct either.
Consider test/CodeGen/X86/vec-sign.ll:

define <4 x i32> @signd(<4 x i32> %a, <4 x i32> %b) nounwind {
entry:
  %b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
  %sub = sub nsw <4 x i32> zeroinitializer, %a
  %0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = and <4 x i32> %a, %0
  %2 = and <4 x i32> %b.lobit, %sub
  %cond = or <4 x i32> %1, %2
  ret <4 x i32> %cond
}

if %b is zero:

  %b.lobit = <4 x i32> zeroinitializer
  %sub = sub nsw <4 x i32> zeroinitializer, %a
  %0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = <4 x i32> %a
  %2 = <4 x i32> zeroinitializer
  %cond = or <4 x i32> %1, %2
  ret <4 x i32> %a
}

whereas we currently generate:
  psignd %xmm1, %xmm0
  retq

which return 0, as %xmm1 is 0.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 26110] clang c compiler produces wrong result for the attached c code with -O2 optimzation

Reply via email to