Hi Chung-Lin!

Just a few quick comments:

On 2021-08-02T21:10:57+0800, Chung-Lin Tang <clt...@codesourcery.com> wrote:
> On 2021/7/23 6:39 PM, Jakub Jelinek wrote:
>> On Fri, Jul 23, 2021 at 06:21:41PM +0800, Chung-Lin Tang wrote:
>>> --- a/libgomp/icv-device.c
>>> +++ b/libgomp/icv-device.c
>>> @@ -61,8 +61,17 @@ omp_is_initial_device (void)
>>>     return 1;
>>>   }
>>>
>>> +int
>>> +omp_get_device_num (void)
>>> +{
>>> +  /* By specification, this is equivalent to omp_get_initial_device
>>> +     on the host.  */
>>> +  return omp_get_initial_device ();
>>> +}
>>> +
>>
>> I think this won't work properly with the intel micoffload, where the host
>> libgomp is used in the offloaded code.
>> For omp_is_initial_device, the plugin solves it by:
>> liboffloadmic/plugin/offload_target_main.cpp
>> overriding it:
>> /* Override the corresponding functions from libgomp.  */
>> extern "C" int
>> omp_is_initial_device (void) __GOMP_NOTHROW
>> {
>>    return 0;
>> }
>>
>> extern "C" int32_t
>> omp_is_initial_device_ (void)
>> {
>>    return omp_is_initial_device ();
>> }
>> but guess it will need slightly more work because we need to copy the value
>> to the offloading device too.
>> It can be done incrementally though.
>
> I guess this part of intelmic functionality will just have to wait later.
> There seem to be other parts of liboffloadmic that seems to need re-work,
> e.g. omp_get_num_devices() return mic_engines_total, where it should actually
> return the number of all devices (not just intelmic). omp_get_initial_device()
> returning -1 (which I don't quite understand), etc.

(I'm confirming there are such pre-existing problems with Intel MIC; I've
never looked up any details.)

> Really suggest to have intelmic support be re-worked as an offload plugin 
> inside
> libgomp, rather than floating outside by itself.

Well, it is a regular libgomp plugin, just its sources are not in
'libgomp/plugin/' and it's not built during libgomp build.  Are you
suggesting just to move it into 'libgomp/plugin/'?  This may need some
more complicated setup because of its 'liboffloadmic' dependency?


>>> --- a/libgomp/libgomp-plugin.h
>>> +++ b/libgomp/libgomp-plugin.h
>>> @@ -102,6 +102,12 @@ struct addr_pair
>>>     uintptr_t end;
>>>   };
>>>
>>> +/* This symbol is to name a target side variable that holds the designated
>>> +   'device number' of the target device. The symbol needs to be available 
>>> to
>>> +   libgomp code and the  offload plugin (which in the latter case must be
>>> +   stringified).  */
>>> +#define GOMP_DEVICE_NUM_VAR __gomp_device_num
>>
>> For a single var it is acceptable (though, please avoid the double space
>> before offload plugin in the comment), but once we have more than one
>> variable, I think we should simply have a struct which will contain all the
>> parameters that need to be copied from the host to the offloading device at
>> image load time (and have eventually another struct that holds parameters
>> that we'll need to copy to the device on each kernel launch, I bet some ICVs
>> will be one category, other ICVs another one).

ACK.  Also other program state, like 'fenv' or the gfortran "state blob".
This is <https://gcc.gnu.org/PR92827> "Missing data/state
sharing/propagation between host and offloading devices".

> Actually, if you look at the 5.[01] specifications, omp_get_device_num() is 
> not
> defined in terms of an ICV. Maybe it conceptually ought to be, but the current
> description of "the device number of the device on which the calling thread is
> executing" is not one if the defined ICVs.
>
> It looks like there will eventually be some kind of ICV block handled in a 
> similar
> way, but I think that the modifications will be straightforward then. For now,
> I think it's okay for GOMP_DEVICE_NUM_VAR to just be a normal global variable.

There is, by the way, precedent for that:
'libgomp/config/nvptx/time.c:double __nvptx_clocktick', set up in
'libgomp/plugin/plugin-nvptx.c:nvptx_set_clocktick' ('cuModuleGetGlobal'
to get the device address, followed by 'cuMemcpyHtoD'), invoked from
'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_load_image', quite simple.

For the case discussed here, we're now adding more complex
'other_count'/'other_entries'/'num_others' bookkeeping.  (Great that all
of the plugins plus 'libgomp/target.c' invented their own terminology...)
;-)

> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c

> @@ -3305,6 +3306,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>    struct kernel_info *kernel;
>    int kernel_count = image_desc->kernel_count;
>    unsigned var_count = image_desc->global_variable_count;
> +  int other_count = 1;
>
>    agent = get_agent_info (ord);
>    if (!agent)
> @@ -3321,7 +3323,8 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>
>    GCN_DEBUG ("Encountered %d kernels in an image\n", kernel_count);
>    GCN_DEBUG ("Encountered %u global variables in an image\n", var_count);
> -  pair = GOMP_PLUGIN_malloc ((kernel_count + var_count - 2)
> +  GCN_DEBUG ("Expect %d other variables in an image\n", other_count);
> +  pair = GOMP_PLUGIN_malloc ((kernel_count + var_count + other_count - 2)
>                            * sizeof (struct addr_pair));
>    *target_table = pair;
>    module = (struct module_info *)
> @@ -3396,6 +3399,35 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>        pair++;
>      }
>
> +  GCN_DEBUG ("Looking for variable %s\n", STRINGX (GOMP_DEVICE_NUM_VAR));
> +
> +  hsa_status_t status;
> +  hsa_executable_symbol_t var_symbol;
> +  status = hsa_fns.hsa_executable_get_symbol_fn (agent->executable, NULL,
> +                                              STRINGX (GOMP_DEVICE_NUM_VAR),
> +                                              agent->id, 0, &var_symbol);
> +  if (status == HSA_STATUS_SUCCESS)
> +    {
> +      uint64_t device_num_varptr;
> +      uint32_t device_num_varsize;
> +
> +      status = hsa_fns.hsa_executable_symbol_get_info_fn
> +     (var_symbol, HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
> +      &device_num_varptr);
> +      if (status != HSA_STATUS_SUCCESS)
> +     hsa_fatal ("Could not extract a variable from its symbol", status);
> +      status = hsa_fns.hsa_executable_symbol_get_info_fn
> +     (var_symbol, HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_SIZE,
> +      &device_num_varsize);
> +      if (status != HSA_STATUS_SUCCESS)
> +     hsa_fatal ("Could not extract a variable size from its symbol", status);
> +
> +      pair->start = device_num_varptr;
> +      pair->end = device_num_varptr + device_num_varsize;
> +    }
> +  else
> +    pair->start = pair->end = 0;
> +

Is the 'else' branch to accomodate "old" executables running against
"new" libgomp?  If yes, then please add a comment, "for compatibility
with pre-GCC 12 executables" or similar.

Also, add 'pair++;', to avoid future confusion?

>    /* Ensure that constructors are run first.  */
>    struct GOMP_kernel_launch_attributes kla =
>      { 3,
> @@ -3418,7 +3450,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>    if (module->fini_array_func)
>      kernel_count--;
>
> -  return kernel_count + var_count;
> +  return kernel_count + var_count + other_count;
>  }
>
>  /* Unload GCN object-code module described by struct gcn_image_desc in

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c

> @@ -1265,7 +1266,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>    CUmodule module;
>    const char *const *var_names;
>    const struct targ_fn_launch *fn_descs;
> -  unsigned int fn_entries, var_entries, i, j;
> +  unsigned int fn_entries, var_entries, other_entries, i, j;
>    struct targ_fn_descriptor *targ_fns;
>    struct addr_pair *targ_tbl;
>    const nvptx_tdata_t *img_header = (const nvptx_tdata_t *) target_data;
> @@ -1295,8 +1296,11 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>    fn_entries = img_header->fn_num;
>    fn_descs = img_header->fn_descs;
>
> +  /* Currently, the only other entry kind is 'device number'.  */
> +  other_entries = 1;
> +
>    targ_tbl = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
> -                              * (fn_entries + var_entries));
> +                              * (fn_entries + var_entries + other_entries));
>    targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
>                                * fn_entries);
>
> @@ -1345,9 +1349,22 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, 
> const void *target_data,
>        targ_tbl->end = targ_tbl->start + bytes;
>      }
>
> +  CUdeviceptr device_num_varptr;
> +  size_t device_num_varsize;
> +  CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &device_num_varptr,
> +                               &device_num_varsize, module,
> +                               STRINGX (GOMP_DEVICE_NUM_VAR));
> +  if (r == CUDA_SUCCESS)
> +    {
> +      targ_tbl->start = (uintptr_t) device_num_varptr;
> +      targ_tbl->end = (uintptr_t) (device_num_varptr + device_num_varsize);
> +    }
> +  else
> +    targ_tbl->start = targ_tbl->end = 0;
> +

Same comment for 'else', please.

Also, 'targ_tbl++;', to avoid future confusion?

>    nvptx_set_clocktick (module, dev);
>
> -  return fn_entries + var_entries;
> +  return fn_entries + var_entries + other_entries;
>  }
>
>  /* Unload the program described by TARGET_DATA.  DEV_DATA is the

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -1974,6 +1974,9 @@ gomp_load_image_to_device (struct gomp_device_descr 
> *devicep, unsigned version,
>    int num_funcs = host_funcs_end - host_func_table;
>    int num_vars  = (host_vars_end - host_var_table) / 2;
>
> +  /* Others currently is only 'device_num' */
> +  int num_others = 1;
> +
>    /* Load image to device and get target addresses for the image.  */
>    struct addr_pair *target_table = NULL;
>    int i, num_target_entries;
> @@ -1982,7 +1985,9 @@ gomp_load_image_to_device (struct gomp_device_descr 
> *devicep, unsigned version,
>      = devicep->load_image_func (devicep->target_id, version,
>                               target_data, &target_table);
>

Do I understand right that the special-casing here:

> -  if (num_target_entries != num_funcs + num_vars)
> +  if (num_target_entries != num_funcs + num_vars
> +      /* Others (device_num) are included as trailing entries in pair list.  
> */
> +      && num_target_entries != num_funcs + num_vars + num_others)
>      {
>        gomp_mutex_unlock (&devicep->lock);
>        if (is_register_lock)
> @@ -2054,6 +2059,31 @@ gomp_load_image_to_device (struct gomp_device_descr 
> *devicep, unsigned version,
>        array++;
>      }
>
> +  /* Last entry is for the on-device 'device_num' variable. Tolerate case
> +     where plugin does not return this entry.  */
> +  if (num_funcs + num_vars < num_target_entries)

... and here is just for compatibility with the unmodified Intel MIC
plugin?  Wouldn't it be simpler to just add a dummy value to that one, to
avoid this special-casing?

> +    {
> +      struct addr_pair *device_num_var = &target_table[num_funcs + num_vars];
> +      if (device_num_var->start != 0)

Please also here add a comment about the '0' case, for completeness.

> +     {
> +       /* The index of the devicep within devices[] is regarded as its
> +          'device number', which is different from the per-device type
> +          devicep->target_id.  */
> +       int device_num_val = (int) (devicep - &devices[0]);
> +       if (device_num_var->end - device_num_var->start != sizeof (int))
> +         {
> +           gomp_mutex_unlock (&devicep->lock);

Add:

    if (is_register_lock)
      gomp_mutex_unlock (&register_lock);

..., I suppose?

> +           gomp_fatal ("offload plugin managed 'device_num' not of expected "
> +                       "format");
> +         }
> +
> +       /* Copy device_num value to place on device memory, hereby actually
> +          designating its device number into effect.  */
> +       gomp_copy_host2dev (devicep, NULL, (void *) device_num_var->start,
> +                           &device_num_val, sizeof (int), false, NULL);
> +     }
> +    }
> +
>    free (target_table);
>  }

Am I understanding this correctly that instead of handling it all inside
the plugins (like '__nvptx_clocktick'), we're here doing the
'other_count'/'other_entries'/'num_others' bookkeeping plus
'gomp_copy_host2dev', because we don't know 'device_num_val' inside the
plugins, and also to introduce a more generic interface for future ICVs
etc.?

It feels to me as if this interface can be improved still.  (For example,
would it make sense to pass into the plugins (via 'load_image_func') some
kind of table, containing 'device_num_val' etc., and then let each plugin
to the setup?)  But I'm OK if we leave that for later, once we actually
add more ICVs etc.


> --- a/libgomp/config/gcn/icv-device.c
> +++ b/libgomp/config/gcn/icv-device.c
> @@ -70,6 +70,16 @@ omp_is_initial_device (void)
>    return 0;
>  }
>
> +/* This is set to the device number of current GPU during device 
> initialization,
> +   when the offload image containing this libgomp portion is loaded.  */
> +static int GOMP_DEVICE_NUM_VAR;
> +
> +int
> +omp_get_device_num (void)
> +{
> +  return GOMP_DEVICE_NUM_VAR;
> +}
> +
>  ialias (omp_set_default_device)
>  ialias (omp_get_default_device)
>  ialias (omp_get_initial_device)

I suppose also add 'ialias (omp_get_device_num)' here, like...

> --- a/libgomp/config/nvptx/icv-device.c
> +++ b/libgomp/config/nvptx/icv-device.c
> @@ -58,8 +58,19 @@ omp_is_initial_device (void)
>    return 0;
>  }
>
> +/* This is set to the device number of current GPU during device 
> initialization,
> +   when the offload image containing this libgomp portion is loaded.  */
> +static int GOMP_DEVICE_NUM_VAR;
> +
> +int
> +omp_get_device_num (void)
> +{
> +  return GOMP_DEVICE_NUM_VAR;
> +}
> +
>  ialias (omp_set_default_device)
>  ialias (omp_get_default_device)
>  ialias (omp_get_initial_device)
>  ialias (omp_get_num_devices)
>  ialias (omp_is_initial_device)
> +ialias (omp_get_device_num)

... here, and...

> --- a/libgomp/icv-device.c
> +++ b/libgomp/icv-device.c
> @@ -61,8 +61,17 @@ omp_is_initial_device (void)
>    return 1;
>  }
>
> +int
> +omp_get_device_num (void)
> +{
> +  /* By specification, this is equivalent to omp_get_initial_device
> +     on the host.  */
> +  return omp_get_initial_device ();
> +}
> +
>  ialias (omp_set_default_device)
>  ialias (omp_get_default_device)
>  ialias (omp_get_initial_device)
>  ialias (omp_get_num_devices)
>  ialias (omp_is_initial_device)
> +ialias (omp_get_device_num)

... here?


> --- a/libgomp/testsuite/lib/libgomp.exp
> +++ b/libgomp/testsuite/lib/libgomp.exp

> +# Return 1 if compiling for offload target intelmic
> +proc check_effective_target_offload_target_intelmic { } {
> +    return [libgomp_check_effective_target_offload_target "*-intelmic"]
> +}

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.c-c++-common/target-45.c
> @@ -0,0 +1,30 @@
> +/* { dg-do run { target { ! offload_target_intelmic } } } */

This means that the test case is skipped as soon as the compiler is
configured for Intel MIC offloading -- even if that's not used during
execution.

>From some older experiment of mine, I do have a
'check_effective_target_offload_device_intel_mic', which I'll propose as
a follow-up, once this is in.

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/target10.f90
> @@ -0,0 +1,20 @@
> +! { dg-do run { target { ! offload_target_intelmic } } }

Likewise.

> +program main
> +  use omp_lib
> +  implicit none
> +  integer :: device_num, host_device_num
> +  logical :: initial_device
> +
> +  host_device_num = omp_get_device_num ()
> +  if (host_device_num .ne. omp_get_initial_device ()) stop 1
> +
> +  !$omp target map(from: device_num, initial_device)
> +  initial_device = omp_is_initial_device ()
> +  device_num = omp_get_device_num ()
> +  !$omp end target
> +
> +  if (initial_device .and. host_device_num .ne. device_num) stop 2

That one matches 'libgomp.c-c++-common/target-45.c':

    if (initial_device && host_device_num != device_num)
      abort ();

..., but here:

> +  if (initial_device .and. host_device_num .eq. device_num) stop 3

... shouldn't that be '.not.initial_device', like in:

    if (!initial_device && host_device_num == device_num)
      abort ();


(Also, I'm not familiar with Fortran operator precedence rules, so
probably would put the individual expressions into braces.  ;-) -- But I
trust you know better than I do, of course.)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to