Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Richard Biener Tue, 17 Nov 2015 07:31:49 -0800

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <tom_devr...@mentor.com>
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > > 
> > > > > > +             NEXT_PASS (pass_scev_cprop);
> > > > > > > > 
> > > > > > > > What's that for?  It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > > 
> > > > > > 
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > > 
> > > 
> > > > That functionality is intented to enable loop removal.
> > > 
> > > 
> > > Let me try to explain in a bit more detail.
> > > 
> > > 
> > > I.
> > > 
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > >    int i;
> > >    for (i = 0; i < n; ++i)
> > >      a[i] = 1;
> > > 
> > >    return i;
> > > }
> > > ...
> > > 
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > > 
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > >    SUCCESS: may be parallelized
> > > ...
> > > 
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI <i_10(4)>
> > > arg of phi to exit:   value i_10 used outside loop
> > >    checking if it a part of reduction pattern:
> > >    FAILED: it is not a part of reduction.
> > > ...
> > > 
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    # i_1 = PHI <i_10(4)>
> > >    _20 = (unsigned int) i_1;
> > > ...
> > > 
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > >    i_1 = PHI <i_10(4)>
> > >    with
> > >    i_1 = n_4(D);
> > > ...
> > > 
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    _20 = (unsigned int) n_4(D);
> > > ...
> > > 
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > > 
> > > 
> > > II.
> > > 
> > > Now, back to oacc kernels.
> > > 
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > >    subroutine foo(n)
> > >      implicit none
> > >      integer :: n
> > >      integer, dimension (0:n-1) :: a, b, c
> > >      integer                    :: i, ii
> > >      do i = 0, n - 1
> > >         a(i) = i * 2
> > >      end do
> > > 
> > >      do i = 0, n -1
> > >         b(i) = i * 4
> > >      end do
> > > 
> > >      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > >      do ii = 0, n - 1
> > >         c(ii) = a(ii) + b(ii)
> > >      end do
> > >      !$acc end kernels
> > > 
> > >      do i = 0, n - 1
> > >         if (c(i) .ne. a(i) + b(i)) call abort
> > >      end do
> > > 
> > >    end subroutine foo
> > > end module test
> > > ...
> > > 
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > >    <bb 4>:
> > >    _13 = *.omp_data_i_4(D).c;
> > >    c.21_14 = *_13;
> > >    _16 = *_9;
> > >    _17 = (integer(kind=8)) _16;
> > >    _18 = *.omp_data_i_4(D).a;
> > >    a.22_19 = *_18;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _24 = *.omp_data_i_4(D).b;
> > >    b.23_25 = *_24;
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = _16 + 1;
> > >    *_9 = _38;
> > >    if (_8 == _16)
> > >      goto <bb 3>;
> > >    else
> > >      goto <bb 4>;
> > > ...
> > > 
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > >    <bb 5>:
> > >    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
> > >    _17 = (integer(kind=8)) D__lsm.29_12;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = D__lsm.29_12 + 1;
> > >    if (_8 == D__lsm.29_12)
> > >      goto <bb 6>;
> > >    else
> > >      goto <bb 7>;
> > > 
> > >    <bb 6>:
> > >    # D__lsm.29_27 = PHI <_38(5)>
> > >    *_9 = D__lsm.29_27;
> > >    goto <bb 3>;
> > 
> > So this store is not actually necessary?
> 
> a.
> In the case of this example, the store is dead.
> 
> There is a corresponding load at the point that we split off the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   D.3635 = .omp_data_arr.25.ii;
>   ii = *D.3635;
> ...
> 
> This load is later removed, given that ii is unused after the region. But once
> the region is split off,  there's nothing in the context of the store to
> suggest that it's dead.
> 
> And to get rid of the load of ii before the region is split off, we would have
> to implement some sort of liveness analysis on pre-ssa code.
> 
> b.
> There's the case where there is an explicit use of ii after the region, in
> which case the store is not dead.
> 
> c.
> And there's the case were we use a data clause on the region, f.i. 'create
> (ii)' to indicate that the variable is neither copied in nor copied out of the
> region (the default for a scalar in a kernels region is 'copy', meaning
> copy-in-and-out).
> 
> [ This means the value of ii after the region is uninitialized. So even if
> there's a read from ii after the region, we cannot consider it connected to
> the store, given that the value written by the store on the accelerator will
> not be copied back to the host. ]
> 
> In this case, we already don't have any load of ii after the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   .omp_data_sizes.28 = {CLOBBER};
>   .omp_data_arr.27 = {CLOBBER};
> ...
> 
> We could insert clobbers for the bits of .omp_data_arr at the end of the
> region to indicate that those are not used. That might enable dse to get rid
> of the dead store.
> 
> 
> But, I think we want a generic solution that handles cases a, b and c, which
> means we have to solve the most difficult case, which is b, where the store is
> not dead.
> 
> >  Or just in an inconvenient place?
> 
> I don't think the place of the store is inconvenient, it would be worse to
> have the store in the loop.
> 
> What is inconvenient about the store is the fact that it reads the final value
> of the iteration variable (which inhibits parloops).
> 
> > >    <bb 7>:
> > >    goto <bb 5>;
> > > ...
> > > 
> > > This makes it similar to the parloops example above, and that's why I've
> > > added pass_scev_cprop in the kernels pass group.
> > > 
> > > [ And for some kernels test-cases with constant loop bound, it's not the
> > > final value replacement bit that does the substitution, but the first bit
> > > in
> > > scev_const_prop using resolve_mixers. So that's a related reason to use
> > > pass_scev_cprop. ]
> > 
> > IMHO autopar needs to handle induction itself.
> 
> I'm not sure what you mean. Could you elaborate?  Autopar handles induction
> variables, but it doesn't handle exit phis reading the final value of the
> induction variable. Is that what you want fixed? How?


Yes.  Perform final value replacement.

> > And the above LIM example
> > is none for why you need two LIM passes...
> 
> Indeed. I'm planning a separate reply to explain in more detail the need for
> the two pass_lims.

Thanks.

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Reply via email to