On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote:

In theory yes, but there are cases where paths converge (like you've shown) 
where
you may have evaluated to a constant on the paths, but it's not a constant at 
the
convergence point.  You have to be very careful using b_c_p like this and it's
been a regular source of kernel bugs.


I'd recommend looking at the .ssa dump and walk forward from there if the .ssa
dump looks correct.

Here is the last dump before thread1 (105t.mergephi2). I don't see anything incorrect in it.

ledtrig_cpu (_Bool is_active)
{
  int old;
  int iftmp.0_1;
  int _5;

  <bb 2> [local count: 1073741824]:
  if (is_active_2(D) != 0)
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 536870913]:

  <bb 4> [local count: 1073741824]:
  # iftmp.0_1 = PHI <1(2), -1(3)>
  _5 = __builtin_constant_p (iftmp.0_1);
  if (_5 != 0)
    goto <bb 5>; [50.00%]
  else
    goto <bb 8>; [50.00%]

  <bb 5> [local count: 536870913]:
  if (iftmp.0_1 >= -128)
    goto <bb 6>; [50.00%]
  else
    goto <bb 8>; [50.00%]

  <bb 6> [local count: 268435456]:
  if (iftmp.0_1 <= 127)
    goto <bb 7>; [34.00%]
  else
    goto <bb 8>; [66.00%]

  <bb 7> [local count: 91268056]:
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_1, "Q" MEM[(int *)&num_active_cpus] 
: "memory", "cc");
  goto <bb 9>; [100.00%]

  <bb 8> [local count: 982473769]:
  __asm__ __volatile__("laa %0,%2,%1
" : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" iftmp.0_1, "Q" MEM[(int 
*)&num_active_cpus] : "memory", "cc");

  <bb 9> [local count: 1073741824]:
  return;

}

There is a single _b_c_p, the immediate asm argument is exactly the argument of _b_c_p, and it is in the branch protected by _b_c_p.

Now the thread1 dump, for comparison

ledtrig_cpu (_Bool is_active)
{
  int old;
  int iftmp.0_4;
  int iftmp.0_6;
  int _7;
  int _12;
  int iftmp.0_13;
  int iftmp.0_14;

  <bb 2> [local count: 1073741824]:
  if (is_active_2(D) != 0)
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 536870912]:
  # iftmp.0_6 = PHI <1(2)>
  _7 = __builtin_constant_p (iftmp.0_6);
  if (_7 != 0)
    goto <bb 6>; [50.00%]
  else
    goto <bb 8>; [50.00%]

  <bb 4> [local count: 536870912]:
  # iftmp.0_4 = PHI <-1(2)>
  _12 = __builtin_constant_p (iftmp.0_4);
  if (_12 != 0)
    goto <bb 5>; [50.00%]
  else
    goto <bb 8>; [50.00%]

  <bb 5> [local count: 268435456]:
  if (iftmp.0_4 >= -128)
    goto <bb 7>; [20.00%]
  else
    goto <bb 8>; [80.00%]

  <bb 6> [local count: 214748364]:
  if (iftmp.0_6 <= 127)
    goto <bb 7>; [12.00%]
  else
    goto <bb 8>; [88.00%]

  <bb 7> [local count: 91268056]:
  # iftmp.0_13 = PHI <iftmp.0_6(6), iftmp.0_4(5)>
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_13, "Q" MEM[(int 
*)&num_active_cpus] : "memory", "cc");
  goto <bb 9>; [100.00%]

  <bb 8> [local count: 982473769]:
  # iftmp.0_14 = PHI <iftmp.0_4(5), iftmp.0_4(4), iftmp.0_6(6), iftmp.0_6(3)>
  __asm__ __volatile__("laa %0,%2,%1
" : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" iftmp.0_14, "Q" MEM[(int 
*)&num_active_cpus] : "memory", "cc");

  <bb 9> [local count: 1073741824]:
  return;

}

Thread1 decides to separate the paths is_active and !is_active (surprisingly, for one it optimizes out the comparison <= 127 and for the other the comparison >= -128, while it could optimize both in both cases). And it decides to converge after the comparisons, but before the asm.

What the pass did does seem to hurt. It looks like if we duplicate _b_c_p, we may need to duplicate far enough to include all the blocks dominated by _b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be optimized to true, because for a boolean

b is the same as b ? true : false
__builtin_constant_p(b ? true : false) would be the same as b ? __builtin_constant_p(true) : __builtin_constant_p(false), i.e. true.

It is too bad we don't have any optimization pass using ranges between IPA and thread1, that would have gotten rid of the comparisons, and hence the temptation to thread. Adding always_inline on atomic_add (or flatten on the caller) does help: EVRP removes the comparisons.

Do you see a way forward without changing what thread1 does or declaring the testcase as unsupported?

--
Marc Glisse

Reply via email to