Re: [RFC] Offloading Support in libgomp

Michael V. Zolotukhin Mon, 26 Aug 2013 05:00:21 -0700

> No need for the device and handler IMHO, each vector would correspond to
> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
> and all those calls would be called with device id.
Probably yes.


> Let's talk about some concrete example (though, I see the gimplifier
> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
> FE, ++todo).
That's a great idea, I wanted to go into an example too, but I'd take a simpler
case for the beginning.

Here is a modified version of your test - I just removed all pragmas, that
don't deal with offloading.  They aren't new in OpenMP4, and GCC supports them
well, so we could concentrate on the others - like 'pragma target' and 'pragma
target data'.

So, here is the original code:

  #pragma omp declare target
  int v = 6;
  int tgt ()
  {
    #pragma omp atomic update
      v++;
    return 0;
  }
  #pragma omp end declare target

  float
  bar (int x, int y, int z)
  {
    float b[1024], c[1024], s = 0;
    int i, j;
    baz (b, c, x);
    #pragma omp target data map(to: b)
    {
      #pragma omp target map(tofrom: c) map(from:s)
      for (i = 0; i < 1024; i++)
        tgt (), s += b[i] * c[i];
      #pragma omp target update from(b, v)
    }
    return s;
  }
Let's write what we want this to be expanded to.  For now let's ignore obvious
problems leading to ICEs that you mentioned - they are certainly need to be
addressed, but I don't think they affect the overall design which we're
discussing here.

As I currently see it, the given code would be expanded to something like
this:

  // Create two versions of V: for host and for target
  int v;
  int v_target __attribute(target);

  // The same for TGT function
  int tgt ()
  {
    .. update v ..
  }
  int tgt_target () __attribute(target)
  {
    .. update v_target ..
  }

  float
  bar (int x, int y, int z)
  {
    float b[1024], c[1024], s = 0;
    int i, j;
    baz (b, c, x);
    // #pragma omp target data map(to: b)
    vec<data_descriptor> data_desc;
    data_desc.push ({&b, 1024*sizeof(float), TO});
    GOMP_target_data (&data_desc);
    {
      // #pragma omp target map(tofrom: c) map(from:s)
      data_desc.push ({&c, 1024*sizeof(float), TOFROM});
      data_desc.push ({&s, sizeof(float), FROM});
      GOMP_target_data (&data_desc); // Add mapping for S and C variables,
                                     // mapping for B shouldn't change
      GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
                                              // FOO1_TARGET with arguments
                                              // from vector DATA_DESC

      // #pragma omp target update from(b, v)
      vec<data_descriptor> data_desc_update; // target update pragma require a
                                             // separate vector
      data_desc_update.push ({&b, 1024*sizeof(float), FROM});
      data_desc_update.push ({&v, sizeof(int), FROM});
      GOMP_target_data (&data_desc_update);
    }
    return s;
  }
  void
  foo1 (vec<data_descriptor> data_desc)
  {
    float b = *data_desc[0].host_address;
    float c = *data_desc[1].host_address;
    float s = 0;
    int i;
    for (i = 0; i < 1024; i++)
      tgt (), s += b[i] * c[i];
    *data_desc[2].host_address = s;
  }
  void
  foo1_target (int n, void **arguments) __attribute(target)
  {
    float b = *arguments[0];
    float c = *arguments[1];
    float s = 0;
    int i;
    for (i = 0; i < 1024; i++)
      tgt_target (), s += b[i] * c[i];
    *arguments[2] = s;
  }

That's how I think the code should look like after omp-expanding.  I.e. all
variables and functions marked with 'target declare' are cloned so that we have
host and target versions available (if we have N different targets, then we need
N+1 versions).  All regions, corresponding to 'pragma omp target' are outlined
and, also, cloned to N+1 versions.  'pragma target data' are replaced with
generating of vector of data-descriptors and call to GOMP_target_data which
performs actual data mapping and marshalling.  We also could call
GOMP_target_data to invoke additional data-mapping/marshalling - e.g. we need
this in 'pragma target' where we add clause MAP(TOFROM:C).  'pragma omp target
update' needs a separate vector, as its mapping could differ from mapping of the
embracing 'pragma target data'.

But all that could be quite meaningless unless internals of GOMP_target{_data}
are discussed.  So, let's proceed to libgomp part and discuss what these
functions should do.  From my POV, thesy would perform the following:
1) GOMP_target_data:
for each element of the input vector check whether it's already had a mapping
and if not, create a new one.  Also, necessary marshalling is triggered from
here (but probably, it's better to move data transferring to a separate
routine).
Result of this function work would be a consistent data structure containing all
mapped memory entries as well as their handlers, representing target-side
addresses.
2) GOMP_target:
First of all, this function would call gomp_choose_device_for_offload that would
check all available targets and choose a one for offloading.  Host could also be
chosen here.
If host is chosen, we just call host-version of the routine (the function
address is passed via the first argument) and pass data_descriptor vector to it.
If target-device is chosen we do the following:
Create vector of of handlers corrseponding to data descriptors from the input
vector.  Pass the routine name as well as the vector of handlers to function 
gomp_run_offloaded_function from the target plugin.  That routine perform the
actual offloading, waits for the end and returns.

Does this overall scheme sounds ok to you?

> Now, for GOMP_target we want omplower to replace the var references
> like b or c with something like .omp_target_data->b, .omp_target_data->c
> etc., where the structure will contain the target addresses of the
> variables.  So, GOMP_target would again receive vector of the
> { mapkind, hostaddr, length }, do the lookups, allocations / copying
> like for GOMP_target_data, but also prepare a vector of the corresponding
> target addresses that it would pass to the target function.
Agreed.  I tried to describe and rephrase that a bit above to make sure we both
mean the same here.

> Automatic variables defined in the scope of #pragma omp target body
> don't need any special treatment (but I hope gimplifier doesn't do anything
> for them), they will be just automatic variables inside of the target
> outlined body.
I hope that too.

> But let's start with non-optimized code,
> everything is passed as target address of the allocated spot.
Agreed.

> As specs are target specific, I'm afraid you'll need to be looking for
> the gcc driver for the target, not lto1 binary.
I think I didn't get it.  Could you explain this point?  What are the specs
here?

> Configure could record the names, or you could scan a directory with the
> plugins and dlopen all shared libraries in there, ...
I'd prefer recording at configure step - to me it looks more robust, in general
I'm ok with both options.

---
Thanks, Michael



On 23 Aug 18:16, Jakub Jelinek wrote:
> On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
> > That makes sense.  We could maintain a vector of descriptors for each
> > encountered MAP clause and push to and pop from it when needed (when
> > e.g. new mapping is encountered inside 'pragma omp target data').  The
> > desciptor should contain address in the host memory, size of the mapped
> > block, type of mapping, related device, and handler, which would be
> > returned for this mapping by runtime.  Having vector of such
> > descriptors, we could pass it as an argument for outlined functions - in
> > them we need to extract needed addresses from the vector before
> > executing the body.  Did I get it right?
> 
> No need for the device and handler IMHO, each vector would correspond to
> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
> and all those calls would be called with device id.
> 
> > Also, a bit unclear point here is how should we generate these
> > extractions in target-version of the outlined function - seemingly we
> > won't pass this entire vector to it, so it's unclear out of what should
> > we extract the data.  What do you think on this?
> 
> Let's talk about some concrete example (though, I see the gimplifier
> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
> FE, ++todo).
> 
> void baz (float *, float *, int);
> 
> #pragma omp declare target
> int v = 6;
> int tgt ()
> {
>   #pragma omp atomic update
>     v++;
>   return 0;
> }
> #pragma omp end declare target
> 
> float
> bar (int x, int y, int z)
> {
>   float b[1024], c[1024], s = 0;
>   int i, j;
>   baz (b, c, x);
>   #pragma omp target data map(to: b)
>   {
>     #pragma omp target map(tofrom: c)
> #if 0
>       #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
>         #pragma omp distribute dist_schedule(static, 4) collapse(1)
>           for (j=0; j < x; j += y)
> #else
>         j = 0;
> #endif
>             #pragma omp parallel for reduction(+:s)
>               for (i = j; i < j + y; i++)
>                 tgt (), s += b[i] * c[i];
>     #pragma omp target update from(b, v)
>   }
>   return s;
> }
> 
> float
> foo (int x)
> {
>   float b[1024], c[1024], s = 0;
>   int i;
>   baz (b, c, x);        
>   #pragma omp target map(to: b, c)
>     #pragma omp parallel for reduction(+:s)
>       for (i = 0; i < x; i++)
>         tgt (), s += b[i] * c[i];
>   return s;
> }
> 
> This ICEs during ompexp right now otherwise and obviously even omplower
> doesn't DTRT.
> 
> So we have something like:
> 
>   #pragma omp target data map(to:b)
>   #pragma omp target map(tofrom:j)
>   j = 0;
>   #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) 
> [child fn: _Z3bariii._omp_fn.0 (???)]
>   #pragma omp for nowait private(i)
>   for (i = j; i < D.2235; i = i + 1)
>     {
>       tgt ();
>       D.2236 = b[i];
>       D.2237 = c[i];
>       D.2238 = D.2236 * D.2237;
>       s = D.2238 + s;
>     }
>   #pragma omp target update from(v) from(b)
> 
> On #pragma omp target it clearly is missing many other map clauses,
> like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
> debug later on why they disappeared or weren't added.
> 
> In any case, the only thing GOMP_target_data can do is take the vector
> of the map clauses { mapkind, hostaddr, length } and look them up
> one by one in the mapping of the device and if not present there, allocate
> and/or copy and remember.
> 
> Now, for GOMP_target we want omplower to replace the var references
> like b or c with something like .omp_target_data->b, .omp_target_data->c
> etc., where the structure will contain the target addresses of the
> variables.  So, GOMP_target would again receive vector of the
> { mapkind, hostaddr, length }, do the lookups, allocations / copying
> like for GOMP_target_data, but also prepare a vector of the corresponding
> target addresses that it would pass to the target function.
> 
> Automatic variables defined in the scope of #pragma omp target body
> don't need any special treatment (but I hope gimplifier doesn't do anything
> for them), they will be just automatic variables inside of the target
> outlined body.  Other automatic variables in the function containing #pragma 
> omp
> target could have some optimization for them, if there aren't any #pragma
> omp target data directives referencing them around the #pragma omp target
> that references them, such variables are guaranteed not to be mapped
> in the target device upon GOMP_target call, thus such vars could be e.g.
> allocated in a flexible array at the end of the .omp_target_data
> structure.  Also for non-addressable variables supposedly we could consider
> promoting them into a temporary variable (at the start of GOMP_target
> body load them from .omp_target_data->something, at the end store them back
> (well, depending on map kind)).  But let's start with non-optimized code,
> everything is passed as target address of the allocated spot.
> 
> Also, GOMP_target{_data,} could just lookup addresses from the whole vector
> and remember what succeeded and what failed (i.e. what has been already
> mapped and thus noop and what needs mapping and depending on mapkind
> copying) and sum up the amount of memory that needs allocation for the
> latter ones, then just allocate in the device everything at once and just
> partition it for the individual vars.
> 
> > > I meant just a single plugin that would handle all of them, or as richi
> > > said, perhaps teach LTO plugin to do that.
> > > For options, my vision was something like:
> > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> > > which would mean:
> > > 1) compile LTO IL from the accelerator section for mic with
> > >    the originally recorded gcc command line options with the Target 
> > > options
> > >    removed and no extra options added
> > > 2) compile LTO IL also for hsail target, with originally recorded gcc
> > >    command line options but Target options and -mfoobaz=4 -mbazbaz
> > >    options added
> > > 3) don't compile for ptx
> > > The thing is if you originally compile with
> > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> > > the -m* options might not apply to the target compiler at all.
> > > So you'd construct the command line from the original command line sans
> > > CL_TARGET options, append to that the link time override for the
> > > accelerator.  Then another thing is how to find out the corresponding
> > > compiler (including its driver) for the target from the plugin.
> > Could we set some correspondance between '-ftarget' option value and
> > corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
> > xyz-cc1.  I haven't looked in details at how the compiler plugins work,
> > so maybe I said something unfeasable:)
> 
> As specs are target specific, I'm afraid you'll need to be looking for
> the gcc driver for the target, not lto1 binary.
> 
> > > libgomp would start by trying to dlopen all available plugins,
> > > and for each of them call some routine in them that would query the hw
> > > for available devices, then libgomp would assign device ids to them (0 and
> > > up) and then for target specific parts just dispatch again to the plugin
> > > corresponding to the chosen device.
> > In libgomp we have a similar problem - there we would need to find out
> > plugins names from somewhere.  The difference is that libgomp would
> > always iterate through all plugins independently on compiler options,
> > but even with this I currently have no idea of how to populate list of
> > plugins names (I suppose, this should be done somewhere at
> > configure/make step of libgomp building process?).
> 
> Configure could record the names, or you could scan a directory with the
> plugins and dlopen all shared libraries in there, ...
> 
>       Jakub

Re: [RFC] Offloading Support in libgomp

Reply via email to