On 16/11/15 13:45, Richard Biener wrote:
+             NEXT_PASS (pass_scev_cprop);
> >
> >What's that for?  It's supposed to help removing loops - I don't
> >expect kernels to vanish.
>
>I'm using pass_scev_cprop for the "final value replacement" functionality.
>Added comment.

That functionality is intented to enable loop removal.

Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of the iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
  int i;
  for (i = 0; i < n; ++i)
    a[i] = 1;

  return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
  SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we find in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit:   value i_10 used outside loop
  checking if it a part of reduction pattern:
  FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi (the one in bb 6 defining i_1) which is not part of a reduction:
...
  <bb 4>:
  # i_13 = PHI <0(3), i_10(5)>
  _5 = (long unsigned int) i_13;
  _6 = _5 * 4;
  _8 = a_7(D) + _6;
  *_8 = 1;
  i_10 = i_13 + 1;
  if (n_4(D) > i_10)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  goto <bb 4>;

  <bb 6>:
  # i_1 = PHI <i_10(4)>
  _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
  i_1 = PHI <i_10(4)>
  with
  i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so auto-parallelization succeeds:
...
  <bb 4>:
  # i_13 = PHI <0(3), i_10(5)>
  _5 = (long unsigned int) i_13;
  _6 = _5 * 4;
  _8 = a_7(D) + _6;
  *_8 = 1;
  i_10 = i_13 + 1;
  if (n_4(D) > i_10)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  goto <bb 4>;

  <bb 6>:
  _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument defined outside loop", for a slightly different testcase where despite the final value replacement autopar still fails. ]


II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
  subroutine foo(n)
    implicit none
    integer :: n
    integer, dimension (0:n-1) :: a, b, c
    integer                    :: i, ii
    do i = 0, n - 1
       a(i) = i * 2
    end do

    do i = 0, n -1
       b(i) = i * 4
    end do

    !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
    do ii = 0, n - 1
       c(ii) = a(ii) + b(ii)
    end do
    !$acc end kernels

    do i = 0, n - 1
       if (c(i) .ne. a(i) + b(i)) call abort
    end do

  end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory iteration variable, with a store to '*_9 = _38'.
...
  <bb 4>:
  _13 = *.omp_data_i_4(D).c;
  c.21_14 = *_13;
  _16 = *_9;
  _17 = (integer(kind=8)) _16;
  _18 = *.omp_data_i_4(D).a;
  a.22_19 = *_18;
  _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
  _24 = *.omp_data_i_4(D).b;
  b.23_25 = *_24;
  _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
  _30 = _23 + _29;
  MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
  _38 = _16 + 1;
  *_9 = _38;
  if (_8 == _16)
    goto <bb 3>;
  else
    goto <bb 4>;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local iteration variable, but we've generated a read of the final value of the iteration variable outside the loop, which means auto-parallelization will fail:
...
  <bb 5>:
  # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
  _17 = (integer(kind=8)) D__lsm.29_12;
  _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
  _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
  _30 = _23 + _29;
  MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
  _38 = D__lsm.29_12 + 1;
  if (_8 == D__lsm.29_12)
    goto <bb 6>;
  else
    goto <bb 7>;

  <bb 6>:
  # D__lsm.29_27 = PHI <_38(5)>
  *_9 = D__lsm.29_27;
  goto <bb 3>;

  <bb 7>:
  goto <bb 5>;
...

This makes it similar to the parloops example above, and that's why I've added pass_scev_cprop in the kernels pass group.

[ And for some kernels test-cases with constant loop bound, it's not the final value replacement bit that does the substitution, but the first bit in scev_const_prop using resolve_mixers. So that's a related reason to use pass_scev_cprop. ]

Thanks,
- Tom

Reply via email to