On Sat, 2020-06-27 at 01:03 +0200, Ilya Leoshkevich wrote:
> On Fri, 2020-06-26 at 16:04 -0600, Jeff Law wrote:
> > On Fri, 2020-06-26 at 23:54 +0200, Ilya Leoshkevich wrote:
> > > How should this work ideally?  Suppose we have:
> > > 
> > > static inline void foo (int i)
> > > {
> > >   if (__builtin_is_constant_p (i))
> > >     asm volatile ("bar" :: "i" (i))
> > >   else
> > >     asm volatile ("baz" :: "d" (i));
> > > }
> > > 
> > > First of all, this is a supported use case, right?
> > Yes, this is a supported case.
> > 
> > > Then, the way I see it, is that at least for a while there must
> > > exist
> > > trees like the ones above, regardless of whether their asm
> > > arguments
> > > match constraints.  But ultimately dead code elimination should get
> > > rid
> > > of the wrong ones before they reach RTL.
> > > E.g. in the example above, the non-inlined version should have
> > > `!__builtin_is_constant_p (i)`, so `bar` should not survive until
> > > RTL.  The same for inlined `foo (1)` version's `baz`.
> > In theory yes, but there are cases where paths converge (like you've
> > shown) where
> > you may have evaluated to a constant on the paths, but it's not a
> > constant at the
> > convergence point.  You have to be very careful using b_c_p like this
> > and it's
> > been a regular source of kernel bugs.
> 
> Is there something specific that a compiler user should look out for?
> For example, here is the kernel code, from which the test was derived:
> 
> static inline void atomic_add(int i, atomic_t *v)
> {
> #ifdef CONFIG_HAVE_MARCH_Z196_FEATURES
>         if (__builtin_constant_p(i) && (i > -129) && (i < 128)) {
>                 __atomic_add_const(i, &v->counter);
>                 return;
>         }
> #endif
>         __atomic_add(i, &v->counter);
> }
>       
> It looks very straightforward - can there still be something wrong
> with its usage of b_c_p?
> 
> > I'd recommend looking at the .ssa dump and walk forward from there if
> > the .ssa
> > dump looks correct.
> > 
> > jeff
> 
> Well, 021t.ssa already has:
> 
> __attribute__((gnu_inline))
> __atomic_add_const (intD.6 valD.2269, intD.6 * ptrD.2270)
> {
>   intD.6 val_3(D) = valD.2269;
>   intD.6 * ptr_2(D) = ptrD.2270;
> ;;   basic block 2, loop depth 0, maybe hot
> ;;    prev block 0, next block 1, flags: (NEW)
> ;;    pred:       ENTRY (FALLTHRU)
>   # .MEM_4 = VDEF <.MEM_1(D)>
>   __asm__ __volatile__("asi %0,%1
> " : "ptr" "=Q" *ptr_2(D) : "val" "i" val_3(D), "Q" *ptr_2(D) :
> "memory", "cc");
>   # VUSE <.MEM_4>
>   return;
> ;;    succ:       EXIT
> 
> }
> 
> which is, strictly speaking, not correct, because val_3(D) and
> valD.2269 are not constant.  But as far as I understand we are willing
> to tolerate trees like this until a certain point.
> 
> What is this point supposed to be?  If I understood you right, 
> 106t.thread1 is already too late - why is it so?
Well, you need to show more context to know if it's strictly OK.

Here's what I get with a cross in the .ssa dump:

;;   basic block 2, loop depth 0, maybe hot
;;    prev block 0, next block 3, flags: (NEW)
;;    pred:       ENTRY (FALLTHRU)
  _1 = __builtin_constant_p (i_5(D));
  if (_1 != 0)
    goto <bb 3>; [INV]
  else
    goto <bb 6>; [INV]
;;    succ:       3 (TRUE_VALUE)
;;                6 (FALSE_VALUE)

;;   basic block 3, loop depth 0, maybe hot
;;    prev block 2, next block 4, flags: (NEW)
;;    pred:       2 (TRUE_VALUE)
  if (i_5(D) >= -128)
    goto <bb 4>; [INV]
  else
    goto <bb 6>; [INV]
;;    succ:       4 (TRUE_VALUE)
;;                6 (FALSE_VALUE)

;;   basic block 4, loop depth 0, maybe hot
;;    prev block 3, next block 5, flags: (NEW)
;;    pred:       3 (TRUE_VALUE)
  if (i_5(D) <= 127)
    goto <bb 5>; [INV]
  else
    goto <bb 6>; [INV]
;;    succ:       5 (TRUE_VALUE)
;;                6 (FALSE_VALUE)

;;   basic block 5, loop depth 0, maybe hot
;;    prev block 4, next block 6, flags: (NEW)
;;    pred:       4 (TRUE_VALUE)
  _2 = &v_6(D)->counter;
  __atomic_add_const (i_5(D), _2);
  // predicted unlikely by early return (on trees) predictor.
  goto <bb 7>; [INV]
;;    succ:       7 (FALLTHRU)

And that's OK.  In particular we have the _b_c_p call on i_5 and the only path 
to
the call to __atomic_add_const is predicated on the result of that b_c_p call. 
Furthermore, we use i_5 in the call to atomic_add_const.

Contrast that to the .thread1 dump:


;;   basic block 3, loop depth 0
;;    pred:       2
  # iftmp.0_6 = PHI <1(2)>
  _7 = __builtin_constant_p (iftmp.0_6);
  if (_7 != 0)
    goto <bb 6>; [50.00%]
  else
    goto <bb 8>; [50.00%]
;;    succ:       6
;;                8

;;   basic block 4, loop depth 0
;;    pred:       2
  # iftmp.0_4 = PHI <-1(2)>
  _12 = __builtin_constant_p (iftmp.0_4);
  if (_12 != 0)
    goto <bb 5>; [50.00%]
  else
    goto <bb 8>; [50.00%]
;;    succ:       5
;;                8

;;   basic block 5, loop depth 0
;;    pred:       4
  if (iftmp.0_4 >= -128)
    goto <bb 7>; [20.00%]
  else
    goto <bb 8>; [80.00%]
;;    succ:       7
;;                8

;;   basic block 6, loop depth 0
;;    pred:       3
  if (iftmp.0_6 <= 127)
    goto <bb 7>; [12.00%]
  else
    goto <bb 8>; [88.00%]
;;    succ:       7
;;                8

;;   basic block 7, loop depth 0
;;    pred:       6
;;                5
  # iftmp.0_13 = PHI <iftmp.0_6(6), iftmp.0_4(5)>
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_13, "Q" MEM[(int
*)&num_active_cpus] : "memory", "cc");
  goto <bb 9>; [100.00%]

In the (now inlined) call to atomic_add_const we use iftmp.0_13 which is set 
from
a PHI node and we never test iftmp.0_13 via b_c_p.

I think that's the key difference.

Having said that, and having a cross compiler and a full set of dumps I can see
that things are fine before .thread1.

Here's the relevant parts of the .mergephi2 dump:

;;   basic block 4, loop depth 0
;;    pred:       2
;;                3
  # iftmp.0_1 = PHI <1(2), -1(3)>
  _5 = __builtin_constant_p (iftmp.0_1);
  if (_5 != 0)
    goto <bb 5>; [50.00%]
  else
    goto <bb 8>; [50.00%]
;;    succ:       5
;;                8

;;   basic block 5, loop depth 0
;;    pred:       4
  if (iftmp.0_1 >= -128)
    goto <bb 6>; [50.00%]
  else
    goto <bb 8>; [50.00%]
;;    succ:       6
;;                8

;;   basic block 6, loop depth 0
;;    pred:       5
  if (iftmp.0_1 <= 127)
    goto <bb 7>; [34.00%]
  else
    goto <bb 8>; [66.00%]
;;    succ:       7
;;                8

;;   basic block 7, loop depth 0
;;    pred:       6
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_1, "Q" MEM[(int
*)&num_active_cpus] : "memory", "cc");
  goto <bb 9>; [100.00%]
;;    succ:       9

So we use iftmp.0_1 in bb7 the path to bb7 is predicated on b_c_p (iftmp.0_1), 
so
we're fine.

So, I agree that backwards jump threading mucked this up.  The question becomes
what can/should we do about it.


Disabling the transformation in the backwards jump threader like you've suggest
just papers over the problem.  I could easily see other passes making similar
transformations.  And there's this nugget in tree-ssa-dom.c:


         /* Resolve __builtin_constant_p.  If it hasn't been
            folded to
integer_one_node by now, it's fairly
            certain that the value simply
isn't constant.  */

Which hints that maybe the right thing to do is just resolve these calls earlier
before things start mucking around too much with the CFG.

The other alternative would be to reject transformations which register an
SSA_NAME for incremental updating when that SSA_NAME is used in a b_c_p call. 
Jump threading (forward and backward) are probably the worst offenders here, but
there could well be others.

We should probably seriously consider verifying that names registered for 
updates
aren't referenced by a b_c_p call.  I wouldn't be surprised if adding such a
check would turn up all kinds of violations.

Jeff

Reply via email to