Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Julian Brown Mon, 19 Apr 2021 04:24:30 -0700

Hi,

(Chung-Lin, question for you buried below.)


On Thu, 15 Apr 2021 19:26:54 +0200
Thomas Schwinge <tho...@codesourcery.com> wrote:

> Hi!
> 
> On 2021-02-26T04:34:50-0800, Julian Brown <jul...@codesourcery.com>
> wrote:
> > This patch  
> 
> Thanks, Julian, for your continued improving of these changes!

You're welcome!

> This has iterated through several conceptually different designs and
> implementations, by several people, over the past several years.

I hope this wasn't a hint that I'd failed to attribute the authorship of
the patch properly? Many apologies if so, that certainly wasn't my
intention!

> > implements a method to track the "private-ness" of
> > OpenACC variables declared in offload regions in gang-partitioned,
> > worker-partitioned or vector-partitioned modes. Variables declared
> > implicitly in scoped blocks and those declared "private" on
> > enclosing directives (e.g. "acc parallel") are both handled.
> > Variables that are e.g. gang-private can then be adjusted so they
> > reside in GPU shared memory.
> >
> > The reason for doing this is twofold: correct implementation of
> > OpenACC semantics  
> 
> ACK, and as mentioned before, this very much relates to
> <https://gcc.gnu.org/PR90115> "OpenACC: predetermined private levels
> for variables declared in blocks" (plus the corresponding use of
> 'private' clauses, implicit/explicit, including 'firstprivate') and
> <https://gcc.gnu.org/PR90114> "Predetermined private levels for
> variables declared in OpenACC accelerator routines", which we thus
> should refer in testcases/ChangeLog/commit log, as appropriate.  I do
> understand we're not yet addressing all of that (and that's fine!),
> but we should capture remaining work items of the PRs and Cesar's
> list in
> <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
> as appropriate.

From that list:

>  * Currently variables in private clauses inside acc loops will not
>    utilize shared memory.

The patch should handle this properly now.

>  * OpenACC routines don't use shared memory, except for reductions and
>    worker state propagation.

Routines weren't a focus of this patch (at the point I inherited it),
and I did not attempt to extend it to cover routines either. TBH the
state there is a bit of an unknown (but the patch won't make the
situation any worse).

>  * Variables local to worker loops don't use shared memory.

That's still true, and IIUC for that to work we'd need to expand
scalars into indexed array references, (i.e. "var" ->
"var_arr[vector_lane]" or similar). It's not clear if/when/why we'd
want to do that.

As an aside, if we want to avoid shared memory for some reason but want
to maintain OpenACC semantics, we'd also have to do a similar
transformation for gang-private variables ("var" ->
"var[gang_number]", where the array is on the stack or in global
memory, or similar). Then for worker-private variables we need to do
"var" -> "var[gang_number * num_workers + worker_number]". We've
avoided needing to do that so far, but for some cases -- maybe large
local private arrays? -- it might be necessary, at some point.

>  * Variables local to automatically partitioned gang and worker loops
>    don't use shared memory.

Local variables in automatically-partitioned gang loops should work fine
now.

>  * Shared memory is allocated globally, not locally on a per-function
>    basis. We're not sure if that matters though.

Arguably, that's down to the target, not this middle-end patch -- this
patch itself might not *help* do per-function allocation, but it
doesn't set a policy that allocation must be global either.

> I was surprised that we didn't really have to fix up any existing
> libgomp testcases, because there seem to be quite some that contain a
> pattern (exemplified by the 'tmp' variable) as follows:
> 
>     int main()
>     {
>     #define N 123
>       int data[N];
>       int tmp;
>     
>     #pragma acc parallel // implicit 'firstprivate(tmp)'
>       {
>         // 'tmp' now conceptually made gang-private here.
>     #pragma acc loop gang
>         for (int i = 0; i < 123; ++i)
>           {
>             tmp = i + 234;
>             data[i] = tmp;
>           }
>       }
>     
>       for (int i = 0; i < 123; ++i)
>         if (data[i] != i + 234)
>           __builtin_abort ();
>       
>       return 0;
>     }
> 
> With the code changes as posted, this actually now does *not* use
> gang-private memory for 'tmp', but instead continues to use
> "thread-private registers", as before.

When "tmp" is a local, non-address-taken scalar like that, it'll
probably end up in a register in offloaded code (or of course be
compiled out completely), both before and after this patch. So I
wouldn't expect this to not work in the pre-patch state.

> Same for:
> 
>     --- s3.c  2021-04-13 17:26:49.628739379 +0200
>     +++ s3_2.c        2021-04-13 17:29:43.484579664 +0200
>     @@ -4,6 +4,6 @@
>        int data[N];
>     -  int tmp;
>      
>     -#pragma acc parallel // implicit 'firstprivate(tmp)'
>     +#pragma acc parallel
>        {
>     +    int tmp;
>          // 'tmp' now conceptually made gang-private here.
>      #pragma acc loop gang
> 
> I suppose that's due to conditionalizing this transformation on
> 'TREE_ADDRESSABLE' (as you're doing), so we should be mostly "safe"
> regarding such existing testcases (but I haven't verified that yet in
> detail).

Right.

> That needs to be documented in testcases, with some kind of dump
> scanning (host compilation-side even; see below).
> 
> A note for later: if this weren't just a 'gang' loop, but 'gang' plus
> 'worker' and/or 'vector', we'd actually be fixing up user code with
> undefined behavior into "correct" code (by *not* making 'tmp'
> gang-private, but thread-private), right?

Possibly -- coming up with a case like that might need a little
"ingenuity"...

> As that may not be obvious to the reader, I'd like to have the
> 'TREE_ADDRESSABLE' conditionalization be documented in the code.  You
> had explained that in
> <http://mid.mail-archive.com/20190612204216.0ec83e4e@squid.athome>: "a
> non-addressable variable [...]".

Yeah that probably makes sense.

> > and optimisation, since shared memory might be faster than
> > the main memory on a GPU.  
> 
> Do we potentially have a problem that making more use of (scarce)
> gang-private memory may negatively affect peformance, because
> potentially fewer OpenACC gangs may then be launched to the GPU
> hardware in parallel? (Of course, OpenACC semantics conformance
> firstly is more important than performance, but there may be ways to
> be conformant and performant; "quality of implementation".)  Have you
> run any such performance testing with the benchmarking codes that
> we've got set up?

I don't have any numbers for this patch, no. As for the question as to
whether there are constructs that are currently compiled in a
semantically-correct way but that this patch pessimises -- I'm not aware
of anything like that, but there might be.

> (As I'm more familiar with that, I'm using nvptx offloading examples
> in the following, whilst assuming that similar discussion may apply
> for GCN offloading, which uses similar hardware concepts, as far as I
> remember.)
> 
> Looking at the existing
> 'libgomp.oacc-c-c++-common/private-variables.c' (random example), for
> nvptx offloading, '-O0', we see the following PTX JIT compilation
> changes (word-'diff' of 'GOMP_DEBUG=1' at run-time):
> 
>     info    : Function properties for 'local_g_1$_omp_fn$0':
>     info    : used 27 registers, 32 stack, [-176-]{+256+} bytes smem,
> 328 bytes cmem[0], 0 bytes lmem info    : Function properties for
> 'local_w_1$_omp_fn$0': info    : used 40 registers, 48 stack,
> [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info    :
> Function properties for 'local_w_2$_omp_fn$0': [...]
>     info    : Function properties for 'parallel_g_1$_omp_fn$0':
>     info    : used 27 registers, 32 stack, [-176-]{+256+} bytes smem,
> 328 bytes cmem[0], 0 bytes lmem info    : Function properties for
> 'parallel_g_2$_omp_fn$0': info    : used 32 registers, 160 stack,
> [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem
> 
> ... that is, PTX '.shared' usage increases from 176 to 256 bytes for
> *all* functions, even though only 'loop_g_4$_omp_fn$0' and
> 'loop_g_5$_omp_fn$0' are actually using gang-private memory.
> 
> Execution testing works before (original code, not using gang-private
> memory) as well as after (code changes as posted, using gang-private
> memory), so use on gang-private memory doesn't seem necessary here for
> "correct execution" -- or at least: "expected execution result".  ;-)
> I haven't looked yet whether there's a potentional issue in the
> testcases here.
> 
> The additional '256 - 176 = 80' bytes of PTX '.shared' memory
> requested are due to GCC nvptx back end implementation's use of a
> global "Shared memory block for gang-private variables":
> 
>      // BEGIN VAR DEF: __oacc_bcast
>      .shared .align 8 .u8 __oacc_bcast[176];
>     +// BEGIN VAR DEF: __gangprivate_shared
>     +.shared .align 32 .u8 __gangprivate_shared[64];
> 
> ..., plus (I suppose) an additional '80 - 64 = 16' padding/unused
> bytes to establish '.align 32' after '.align 8' for '__oacc_bcast'.
> 
> Per
> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities>,
> "Table 15. Technical Specifications per Compute Capability", "Compute
> Capability": "3.5", we have a "Maximum amount of shared memory per
> SM": "48 KB", so with '176 bytes smem', that permits '48 * 1024 / 176
> = 279' thread blocks ('num_gangs') resident at one point in time,
> whereas with '256 bytes smem', it's just '48 * 1024 / 256 = 192'
> thread blocks resident at one point in time.  (Not sure that I got
> all the details right, but you get the idea/concern?)
> 
> Anyway, that shall be OK for now, but we shall later look into
> optimizing that; can't we have '.shared' local to the relevant PTX
> functions instead of global?

As mentioned in a previous posting (probably some time ago!) the NVPTX
backend parts were a bit of the patch I inherited from the earliest
versions of the patch, and didn't alter much. The possibility for
function-local allocation has been raised before (for NVPTX), but I
haven't investigated if it's possible or beneficial.

> Interestingly, compiling with '-O2', we see:
> 
>     // BEGIN VAR DEF: __oacc_bcast
>     .shared .align 8 .u8 __oacc_bcast[144];
>     {+// BEGIN VAR DEF: __gangprivate_shared+}
>     {+.shared .align 128 .u8 __gangprivate_shared[32];+}
> 
> With '-O2', only 'loop_g_5$_omp_fn$0' is using gang-private memory,
> and apparently the PTX JIT is able to figure that out from the PTX
> code that GCC generates, and is then able to localize '.shared'
> memory usage to just 'loop_g_5$_omp_fn$0':
> 
>     [...]
>     info    : Function properties for 'loop_g_4$_omp_fn$0':
>     info    : used 12 registers, 0 stack, 144 bytes smem, 328 bytes
> cmem[0], 0 bytes lmem info    : Function properties for
> 'loop_g_5$_omp_fn$0': info    : used [-30-]{+32+} registers, 32
> stack, [-144-]{+288+} bytes smem, 328 bytes cmem[0], 0 bytes lmem
> info    : Function properties for 'loop_g_6$_omp_fn$0': info    :
> used 13 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0
> bytes lmem [...]
> 
> This strongly suggests to me that indeed there must exist a
> programmatic way to get rid of the global "Shared memory block for
> gang-private variables".
> 
> The additional '288 - 144 = 144' bytes of PTX '.shared' memory
> requested are 32 bytes for 'int x[8]' ('#pragma acc loop gang
> private(x)') plus '288 - 32 - 144 = 112' padding/unused bytes to
> establish '.align 128' (!) after '.align 8' for '__oacc_bcast'.
> That's clearly not ideal: 112 bytes wasted in contrast to just '144 +
> 32 = 176' bytes actually used.  (I have not yet looked why/whether
> this really needs '.align 128'.)

I'm sure improvements are possible there (maybe later?).

> I have not yet looked whether similar concerns exist for the GCC GCN
> back end implementation.  (That one also does set 'TREE_STATIC' for
> gang-private memory, so it's a global allocation?)

Yes, or rather per-CU allocation.

> > Handling of private variables is intimately
> > tied to the execution model for gangs/workers/vectors implemented by
> > a particular target: for current targets, we use (or on mainline,
> > will soon use) a broadcasting/neutering scheme.
> >
> > That is sufficient for code that e.g. sets a variable in
> > worker-single mode and expects to use the value in
> > worker-partitioned mode. The difficulty (semantics-wise) comes when
> > the user wants to do something like an atomic operation in
> > worker-partitioned mode and expects a worker-single (gang private)
> > variable to be shared across each partitioned worker. Forcing use
> > of shared memory for such variables makes that work properly.  
> 
> Are we reliably making sure that gang-private variables (and other
> levels, in general) are not subject to the usual broadcasting scheme
> (nvptx, at least), or does that currently work "by accident"?  (I
> haven't looked into that, yet.)

Yes, that case is explicitly handled by the broadcasting/neutering patch
recently posted. (One of the reasons that patch depends on this one.)

> > In terms of implementation, the parallelism level of a given loop is
> > not fixed until the oaccdevlow pass in the offload compiler, so the
> > patch delays fixing the parallelism level of variables declared on
> > or within such loops until the same point. This is done by adding a
> > new internal UNIQUE function (OACC_PRIVATE) that lists (the address
> > of) each private variable as an argument, and other arguments set
> > so as to be able to determine the correct parallelism level to use
> > for the listed variables. This new internal function fits into the
> > existing scheme for demarcating OpenACC loops, as described in
> > comments in the patch.  
> 
> Yes, thanks, that's conceptually now much better than the earlier
> variants that we had.  :-) (Hooray, again, for Nathan's OpenACC
> execution model design!)
> 
> What we should add, though, is a bunch of testcases to verify that the
> expected processing does/doesn't happen for relevant source code
> constructs.  I'm thinking that when the transformation is/isn't done,
> that gets logged, and we can then scan the dumps accordingly.  Some of
> that is implemented already; we should be able to do such scanning
> generally for host compilation, too, not just offloading compilation.

More test coverage is always welcome, of course.

> > Two new target hooks are introduced:
> > TARGET_GOACC_ADJUST_PRIVATE_DECL and TARGET_GOACC_EXPAND_VAR_DECL.
> > The first can tweak a variable declaration at oaccdevlow time, and
> > the second at expand time.  The first or both of these target hooks
> > can be used by a given offload target, depending on its strategy
> > for implementing private variables.  
> 
> ACK.
> 
> So, currently we're only looking at making the gang-private level
> work. Regarding that, we have two configurations: (1) for GCN
> offloading, 'targetm.goacc.adjust_private_decl' does the work (in
> particular, change 'TREE_TYPE' etc.) and there is no
> 'targetm.goacc.expand_var_decl', and (2) for nvptx offloading,
> 'targetm.goacc.adjust_private_decl' only sets a marker ('oacc
> gangprivate' attribute) and then 'targetm.goacc.expand_var_decl' does
> the work.
> 
> Therefore I suggest we clarify the (currently) expected handling
> similar to:
> 
>     --- gcc/omp-offload.c
>     +++ gcc/omp-offload.c
>     @@ -1854,6 +1854,19 @@ oacc_rewrite_var_decl (tree *tp, int
> *walk_subtrees, void *data) return NULL_TREE;
>      }
>      
>     +static tree
>     +oacc_rewrite_var_decl_ (tree *tp, int *walk_subtrees, void *data)
>     +{
>     +  tree t = oacc_rewrite_var_decl (tp, walk_subtrees, data);
>     +  if (targetm.goacc.expand_var_decl)
>     +    {
>     +      walk_stmt_info *wi = (walk_stmt_info *) data;
>     +      var_decl_rewrite_info *info = (var_decl_rewrite_info *)
> wi->info;
>     +      gcc_assert (!info->modified);
>     +    }
>     +  return t;
>     +}

Why the ugly _ tail on the function name!? I don't think that's a
typical GNU coding standards thing, is it?

>     +
>      /* Return TRUE if CALL is a call to a builtin atomic/sync
> operation.  */ 
>      static bool
>     @@ -2195,6 +2208,9 @@ execute_oacc_device_lower ()
>           COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also
> rewritten to use the new decl, adjusting types of appropriate tree
> nodes as necessary.  */ 
>     +  if (targetm.goacc.expand_var_decl)
>     +    gcc_assert (adjusted_vars.is_empty ());

If you like -- or do something like

>        if (targetm.goacc.adjust_private_decl)
             && !adjusted_vars.is_empty ())

perhaps.

>          {
>            FOR_ALL_BB_FN (bb, cfun)
>     @@ -2217,7 +2233,7 @@ execute_oacc_device_lower ()
>                 memset (&wi, 0, sizeof (wi));
>                 wi.info = &info;
>      
>     -           walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi);
>     +           walk_gimple_op (stmt, oacc_rewrite_var_decl_, &wi);
>      
>                 if (info.modified)
>                   update_stmt (stmt);
> 
> Or, in fact, 'if (targetm.goacc.expand_var_decl)', skip the
> 'adjusted_vars' handling completely?

For the current pair of implementations, sure. I don't think it's
necessary to set that as a constraint for future targets though? I
guess it doesn't matter much until such a target exists.

> I do understand that eventually (in particular, for worker-private
> level?), both 'targetm.goacc.adjust_private_decl' and
> 'targetm.goacc.expand_var_decl' may need to do things, but that's
> currently not meant to be addressed, and thus not fully worked out and
> implemented, and thus untested.  Hence, 'assert' what currently is
> implemented/tested, only.

If you like, no strong feelings from me on that.

> (Given that eventual goal, that's probably sufficient motivation to
> indeed add the 'adjusted_vars' handling in generic 'gcc/omp-offload.c'
> instead of moving it into the GCN back end?)

I'm not sure what moving it to the GCN back end would look like. I
guess it's a question of keeping the right abstractions in the right
place.

> For 'libgomp.oacc-c-c++-common/static-variable-1.c' that I've recently
> added, the code changes here cause execution test FAILs for nvptx
> offloading (because of making 'static' variables gang-private), and
> trigger an ICE with GCN offloading compilation.  It isn't clear to me
> what the desired semantics are for (user-specified) 'static'
> variables -- see <https://github.com/OpenACC/openacc-spec/issues/372>
> "C/C++ 'static' variables" (only visible to members of the GitHub
> OpenACC organization) -- but an ICE clearly isn't the right answer.
> ;-)
> 
> As for certain transformation/optimizations, 'static' variables may be
> synthesized in the GCC middle end, I suppose we should preserve the
> status quo (as documented via
> 'libgomp.oacc-c-c++-common/static-variable-1.c') until #372 gets
> resolved in OpenACC?  (I suppose, skip the transformation if
> 'TREE_STATIC' is set, or similar.)

ICEs are bad -- but a user expecting static variables to do something
meaningful in offloaded code is being somewhat optimistic, I think!

> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target,
> > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name);
> >        goto expand_decl_rtl;
> >  
> > -    case PARM_DECL:
> >      case VAR_DECL:
> > +      /* Allow accel compiler to handle variables that require
> > special
> > +    treatment, e.g. if they have been modified in some way
> > earlier in
> > +    compilation by the adjust_private_decl OpenACC hook.  */
> > +      if (flag_openacc && targetm.goacc.expand_var_decl)
> > +   {
> > +     temp = targetm.goacc.expand_var_decl (exp);
> > +     if (temp)
> > +       return temp;
> > +   }
> > +      /* ... fall through ...  */
> > +
> > +    case PARM_DECL:  
> 
> [TS] Are we sure that we don't need the same handling for a
> 'PARM_DECL', too?  (If yes, to document and verify that, should we
> thus again unify the two 'case's, and in
> 'targetm.goacc.expand_var_decl' add a 'gcc_checking_assert (TREE_CODE
> (var) == VAR_DECL')'?)

Maybe for routines? Those bits date from the earliest version of the
patch and (same excuse again) I didn't have call to revisit those
decisions.

> Also, are we sure that all the following existing processing is not
> relevant to do before the 'return temp' (see above)?  That's not a
> concern for GCN (which doesn't use 'targetm.goacc.expand_var_decl',
> and thus does execute all this following existing processing), but it
> is for nvptx (which does use 'targetm.goacc.expand_var_decl', and
> thus doesn't execute all this following existing processing if that
> returned something).  Or, is 'targetm.goacc.expand_var_decl'
> conceptually and practically meant to implement all of the following
> processing, or is this for other reasons not relevant in the
> 'targetm.goacc.expand_var_decl' case:
> 
> >        /* If a static var's type was incomplete when the decl was
> > written, but the type is complete now, lay out the decl now.  */
> >        if (DECL_SIZE (exp) == 0  
> |            && COMPLETE_OR_UNBOUND_ARRAY_TYPE_P (TREE_TYPE (exp))
> |            && (TREE_STATIC (exp) || DECL_EXTERNAL (exp)))
> |          layout_decl (exp, 0);
> |  
> |        /* fall through */
> |  
> |      case FUNCTION_DECL:
> |      case RESULT_DECL:
> |        decl_rtl = DECL_RTL (exp);
> |      expand_decl_rtl:
> |        gcc_assert (decl_rtl);
> |  
> |        /* DECL_MODE might change when TYPE_MODE depends on
> attribute target |           settings for VECTOR_TYPE_P that might
> switch for the function.  */ |        if (currently_expanding_to_rtl
> |            && code == VAR_DECL && MEM_P (decl_rtl)
> |            && VECTOR_TYPE_P (type) && exp && DECL_MODE (exp) !=
> mode) |          decl_rtl = change_address (decl_rtl, TYPE_MODE
> (type), 0); |        else
> |          decl_rtl = copy_rtx (decl_rtl);
> |  
> |        /* Record writes to register variables.  */
> |        if (modifier == EXPAND_WRITE
> |            && REG_P (decl_rtl)
> |            && HARD_REGISTER_P (decl_rtl))
> |          add_to_hard_reg_set (&crtl->asm_clobbers,
> |                               GET_MODE (decl_rtl), REGNO
> (decl_rtl)); |  
> |        /* Ensure variable marked as used even if it doesn't go
> through |           a parser.  If it hasn't be used yet, write out an
> external |           definition.  */
> |        if (exp)
> |          TREE_USED (exp) = 1;
> |  
> |        /* Show we haven't gotten RTL for this yet.  */
> |        temp = 0;
> |  
> |        /* Variables inherited from containing functions should have
> |           been lowered by this point.  */
> |        if (exp)
> |          context = decl_function_context (exp);
> |        gcc_assert (!exp
> |                    || SCOPE_FILE_SCOPE_P (context)
> |                    || context == current_function_decl
> |                    || TREE_STATIC (exp)
> |                    || DECL_EXTERNAL (exp)
> |                    /* ??? C++ creates functions that are not
> TREE_STATIC.  */ |                    || TREE_CODE (exp) ==
> FUNCTION_DECL); |  
> |        /* This is the case of an array whose size is to be
> determined |           from its initializer, while the initializer is
> still being parsed. |           ??? We aren't parsing while expanding
> anymore.  */ |  
> |        if (MEM_P (decl_rtl) && REG_P (XEXP (decl_rtl, 0)))
> |          temp = validize_mem (decl_rtl);
> |  
> |        /* If DECL_RTL is memory, we are in the normal case and the
> |           address is not valid, get the address into a register.  */
> |  
> |        else if (MEM_P (decl_rtl) && modifier != EXPAND_INITIALIZER)
> |          {
> |            if (alt_rtl)
> |              *alt_rtl = decl_rtl;
> |            decl_rtl = use_anchored_address (decl_rtl);
> |            if (modifier != EXPAND_CONST_ADDRESS
> |                && modifier != EXPAND_SUM
> |                && !memory_address_addr_space_p (exp ? DECL_MODE
> (exp) |                                                 : GET_MODE
> (decl_rtl), |                                                 XEXP
> (decl_rtl, 0), |
> MEM_ADDR_SPACE (decl_rtl))) |              temp =
> replace_equiv_address (decl_rtl, |
>         copy_rtx (XEXP (decl_rtl, 0))); |          }
> |  
> |        /* If we got something, return it.  But first, set the
> alignment |           if the address is a register.  */
> |        if (temp != 0)
> |          {
> |            if (exp && MEM_P (temp) && REG_P (XEXP (temp, 0)))
> |              mark_reg_pointer (XEXP (temp, 0), DECL_ALIGN (exp));
> |          }
> |        else if (MEM_P (decl_rtl))
> |          temp = decl_rtl;
> |  
> |        if (temp != 0)
> |          {
> |            if (MEM_P (temp)
> |                && modifier != EXPAND_WRITE
> |                && modifier != EXPAND_MEMORY
> |                && modifier != EXPAND_INITIALIZER
> |                && modifier != EXPAND_CONST_ADDRESS
> |                && modifier != EXPAND_SUM
> |                && !inner_reference_p
> |                && mode != BLKmode
> |                && MEM_ALIGN (temp) < GET_MODE_ALIGNMENT (mode))
> |              temp = expand_misaligned_mem_ref (temp, mode,
> unsignedp, |                                                MEM_ALIGN
> (temp), NULL_RTX, NULL); |  
> |            return temp;
> |          }
> | [...]
> 
> [TS] I don't understand that yet.  :-|
> 
> Instead of the current "early-return" handling:
> 
>     temp = targetm.goacc.expand_var_decl (exp);
>     if (temp)
>       return temp;
> 
> ... should we maybe just set:
> 
>     DECL_RTL (exp) = targetm.goacc.expand_var_decl (exp)
> 
> ... (or similar), and then let the usual processing continue?

Hum, not sure about that. See above excuse... maybe Chung-Lin
remembers? My guess is the extra processing doesn't matter in practice
for the limited kinds of variables that are handled by that hook, at
least for NVPTX (which skips register allocation, etc. anyway).

> > [snip]
> >    tree fork_kind = build_int_cst (unsigned_type_node,
> > IFN_UNIQUE_OACC_FORK); tree join_kind = build_int_cst
> > (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); 
> > @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree
> > clauses, &join_seq);
> >  
> >        lower_oacc_reductions (loc, clauses, place, inner,
> > -                        fork, join, &fork_seq, &join_seq,
> > ctx);
> > +                        fork, (count == 1) ? private_marker :
> > NULL,
> > +                        join, &fork_seq, &join_seq,  ctx);
> >  
> >        /* Append this level to head. */
> >        gimple_seq_add_seq (head, fork_seq);  
> 
> [TS] That looks good in principle.  Via the testing mentioned above, I
> just want to make sure that this does all the expected things
> regarding differently nested loops and privatization levels.

Feel free to extend test coverage as you see fit...

> >        gimple_seq_add_seq (&new_body, fork_seq);
> > @@ -13262,6 +13369,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p,
> > omp_context *ctx) ctx);
> >        break;
> >      case GIMPLE_BIND:
> > +      if (ctx && is_gimple_omp_oacc (ctx->stmt))
> > +   oacc_record_vars_in_bind (ctx,
> > +                             gimple_bind_vars (as_a <gbind *>
> > (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)),
> > ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *>
> > (stmt)); break;  
> 
> [TS] I have not yet verified whether these lowering case are
> sufficient to also handle the <https://gcc.gnu.org/PR90114>
> "Predetermined private levels for variables declared in OpenACC
> accelerator routines" case.  (If yes, then that needs testcases, too,
> if not, then need to add a TODO note, for later.)

I believe that's a TODO.

> > +       1. They can be recreated, making a pointer to the variable
> > in the new
> > +     address space, or
> > +
> > +       2. The address of the variable in the new address space can
> > be taken,
> > +     converted to the default (original) address space, and
> > the result of
> > +     that conversion subsituted in place of the original
> > ADDR_EXPR node. +
> > +     Which of these is done depends on the gimple statement being
> > processed.
> > +     At present atomic operations and inline asms use (1), and
> > everything else
> > +     uses (2).  At least on AMD GCN, there are atomic operations
> > that work
> > +     directly in the LDS address space.
> > +
> > +     COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also
> > rewritten to use
> > +     the new decl, adjusting types of appropriate tree nodes as
> > necessary.  */  
> 
> [TS] As I understand, this is only relevant for GCN offloading, but
> not nvptx, and I'll trust that these two variants make sense from a
> GCN point of view (which I cannot verify easily).

The idea (hope) is that that's what's necessary "generically", though
the only target using that support is GCN at present. I.e. it's not
supposed to be GCN-specific, necessarily. Of course though, who knows
what some other exotic target will need? (We don't want to be in the
state where each target has to start completely from scratch for this
sort of thing, if we can help it.)

> > +  if (targetm.goacc.adjust_private_decl)
> > +    {
> > +      FOR_ALL_BB_FN (bb, cfun)
> > +   for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
> > +        !gsi_end_p (gsi);
> > +        gsi_next (&gsi))
> > +     {
> > +       gimple *stmt = gsi_stmt (gsi);
> > +       walk_stmt_info wi;
> > +       var_decl_rewrite_info info;
> > +
> > +       info.avoid_pointer_conversion
> > +         = (is_gimple_call (stmt)
> > +            && is_sync_builtin_call (as_a <gcall *> (stmt)))
> > +           || gimple_code (stmt) == GIMPLE_ASM;
> > +       info.stmt = stmt;
> > +       info.modified = false;
> > +       info.adjusted_vars = &adjusted_vars;
> > +
> > +       memset (&wi, 0, sizeof (wi));
> > +       wi.info = &info;
> > +
> > +       walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi);
> > +
> > +       if (info.modified)
> > +         update_stmt (stmt);
> > +     }
> > +    }
> > +
> >    free_oacc_loop (loops);
> >  
> >    return 0;  
> 
> [TS] As disucssed above, maybe can completely skip the 'adjusted_vars'
> rewriting for nvptx offloading?

Yeah sure, if you like.

> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c  
> 
> [TS] Without any code changes, this one FAILs (as expected) with nvptx
> offloading, but with GCN offloading, it already PASSes.

Not sure about that, of course one gets lucky sometimes.
 
> > --- /dev/null
> > +++
> > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90  
> 
> [TS] With code changes as posted, this one FAILs for nvptx offloading
> execution.  (... for all but the Nvidia Titan V GPU in my set of
> testing configurations, huh?)
> 
> > @@ -0,0 +1,25 @@
> > +! Test for worker-private variables
> > +
> > +! { dg-do run }
> > +! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
> > +
> > +program main
> > +  integer :: w, arr(0:31)
> > +
> > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > +    !$acc loop gang worker private(w)
> > +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker
> > partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
> > +    do j = 0, 31
> > +      w = 0
> > +      !$acc loop seq
> > +      do i = 0, 31
> > +        !$acc atomic update
> > +        w = w + 1
> > +        !$acc end atomic
> > +      end do
> > +      arr(j) = w
> > +    end do
> > +  !$acc end parallel
> > +
> > +  if (any (arr .ne. 32)) stop 1
> > +end program main  

Boo. I don't think I saw such a failure on the systems I tested on.
That needs investigation (though it might be something CUDA-version or
GPU specific, hence not directly a GCC problem? Not sure.)

Thanks for review, and please ask if there's anything I can help
further with.

Julian

Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Reply via email to