On 24-11-14 13:12, Richard Biener wrote:
On Mon, 24 Nov 2014, Tom de Vries wrote:

On 24-11-14 12:28, Tom de Vries wrote:
On 17-11-14 11:13, Richard Biener wrote:
On Sat, 15 Nov 2014, Tom de Vries wrote:

On 15-11-14 13:14, Tom de Vries wrote:
Hi,

I'm submitting a patch series with initial support for the oacc
kernels
directive.

The patch series uses pass_parallelize_loops to implement
parallelization of
loops in the oacc kernels region.

The patch series consists of these 8 patches:
...
      1  Expand oacc kernels after pass_build_ealias
      2  Add pass_oacc_kernels
      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
      5  Add pass_loop_im to pass_oacc_kernels
      6  Add pass_ccp to pass_oacc_kernels
      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
      8  Do simple omp lowering for no address taken var
...

This patch lowers integer variables that do not have their address
taken as
local variable.  We use a copy at region entry and exit to copy the
value in
and out.

In the context of reduction handling in a kernels region, this allows
the
parloops reduction analysis to recognize the reduction, even after oacc
lowering has been done in pass_lower_omp.

In more detail, without this patch, the omp_data_i load and stores are
generated in place (in this case, in the loop):
...
                 {
                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
                   {
                     unsigned intD.9 iD.2146;

                     iD.2146 = 0;
                     goto <D.2207>;
                     <D.2208>:
                     D.2216 = .omp_data_iD.2201->cD.2203;
                     c.9D.2176 = *D.2216;
                     D.2177 = (long unsigned intD.10) iD.2146;
                     D.2178 = D.2177 * 4;
                     D.2179 = c.9D.2176 + D.2178;
                     D.2180 = *D.2179;
                     D.2217 = .omp_data_iD.2201->sumD.2205;
                     D.2218 = *D.2217;
                     D.2217 = .omp_data_iD.2201->sumD.2205;
                     D.2219 = D.2180 + D.2218;
                     *D.2217 = D.2219;
                     iD.2146 = iD.2146 + 1;
                     <D.2207>:
                     if (iD.2146 <= 524287) goto <D.2208>; else goto
<D.2209>;
                     <D.2209>:
                   }
...

With this patch, the omp_data_i load and stores for sum are generated
at entry
and exit:
...
                 {
                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
                   D.2216 = .omp_data_iD.2201->sumD.2205;
                   sumD.2206 = *D.2216;
                   {
                     unsigned intD.9 iD.2146;

                     iD.2146 = 0;
                     goto <D.2207>;
                     <D.2208>:
                     D.2217 = .omp_data_iD.2201->cD.2203;
                     c.9D.2176 = *D.2217;
                     D.2177 = (long unsigned intD.10) iD.2146;
                     D.2178 = D.2177 * 4;
                     D.2179 = c.9D.2176 + D.2178;
                     D.2180 = *D.2179;
                     sumD.2206 = D.2180 + sumD.2206;
                     iD.2146 = iD.2146 + 1;
                     <D.2207>:
                     if (iD.2146 <= 524287) goto <D.2208>; else goto
<D.2209>;
                     <D.2209>:
                   }
                   *D.2216 = sumD.2206;
                   #pragma omp return
                 }
...


So, without the patch the reduction operation looks like this:
...
     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)
+ x
...

And with this patch the reduction operation is simply:
...
     sumD.2206 = sumD.2206 + x:
...

OK for trunk?
I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.

Richard,

Thanks for the hint. I've built a reduction example:
...
void __attribute__((noinline))
f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned
int n)
{
    unsigned int i;
    for (i = 0; i < n; ++i)
      *sum += a[i];
}...
and observed that store motion of the *sum store is done by pass_loop_im,
provided the *sum load is taken out of the the loop by pass_pre first.

So alternatively, we could use pass_pre and pass_loop_im to achieve the same
effect.

When trying out adding pass_pre as a part of the pass group
pass_oacc_kernels, I
found that also pass_copyprop was required to get parloops to recognize the
reduction.


Attached patch adds pass_copyprop to pass group pass_oacc_kernels.

Hum, you are gobbling up very many passes here.  In this case copyprop
will also perform trivial constant propagation so maybe it's enough
to replace ccp by copyprop.  Or go the full way and add a FRE pass.


Yep, replacing ccp by copyprop seems to work well enough.

I'll repost once bootstrap and reg-test are done.

Thanks,
- Tom

Reply via email to