On Tue, May 17, 2022 at 11:57:02AM +0200, Marcel Vollweiler wrote:
> > More importantly, I have no idea how this can work when you pass arg_size 0
> > and arg_align 0.  The s variable is in the current function frame, with
> > arg_size 0 nothing is really copied to the generated task.
> > arg_size should be sizeof (memcpy_t) and arg_align __alignof__ (memcpy_t)
> > (well, struct omp_target_memcpy_data).
> 
> The copy function of GOMP_task ("cpyfn") is not used here (set to NULL) and 
> thus
> also arg_size and arg_align are set to 0 since they are related to cpyfn if I
> understand it correctly.

No, arg_size and arg_align are for all (explicit) tasks the size and
alignment of the arguments.  For an included task (one executed by the
encountering thread) we indeed use data directly instead of allocating
arg_size arg_align aligned bytes and copying data to it.  But when we create
a deferred task (that is the only thing that actually can be asynchronous), we
allocate struct gomp_task together with memory for the data (arg_size bytes
aligned to arg_align).  If cpyfn, we invoke that copy function (from source
data to the destination buffer), otherwise memcpy.  cpyfn is a callback that
will do memcpy for parts that need bitwise copy and copy construction /
whatever else is needed for other data.
Looking at your patch, you call GOMP_task always with if_clause = false,
that means it is always included task (like with #pragma omp task if(0)),
but that also means calling GOMP_task doesn't bring any advantages and it is
not asynchronous.
If you called it with if_clause = true, like what #pragma omp task would do,
then the arg_size = 0 and arg_align = 0 would make it not work at all,
so after fixing if_clause, you need to supply sizeof (s) and __alignof__ (s).

> > Also, it would be nice to avoid GOMP_task for the depobj_count == 0 case
> > at least sometimes (but perhaps that can be done incrementally) and instead
> > use some CUDA etc. asynchronous copy APIs.  We don't really need to wait
> > for anything in that case, and from OpenMP POV all we need to make sure is
> > that barrier/taskwait/taskgroup end will know about these "tasks" and
> > wait for them.  So, it can be implemented more like #pragma omp target 
> > nowait
> > instead of #pragma omp task that calls the synchronous omp_target_memcpy.
> > Though, maybe that is how it should be implemented always, something like
> > gomp_create_target_task and its caller.  We already use that single routine
> > for multiple purposes (target nowait as well as target enter/exit data
> > nowait), so just telling it somehow that it shouldn't do mapping/unmapping
> > and perhaps target execution and instead copying would be nice.
> 
> I dont't see/understand the advantage using gomp_create_target_task over
> GOMP_task. Whether the task waits for dependencies
> ("gomp_task_maybe_wait_for_dependencies") depends on GOMP_TASK_FLAG_DEPEND 
> which
> is only set if depobj_count > 0 and depobj_list != NULL. Thus, there shouldn't
> be any waiting in case of depobj_count == 0? Additionally, in both functions a
> new thread is created - independently of dependencies.

GOMP_task never creates a new thread.
gomp_create_target_task can create (but just once) an unshackeled thread
that runs on the side, doesn't do normal OpenMP user work and just polls the
offloading device and performs unmapping or whatever is needed to finish a
nowait offloaded task.

The disadvantage of GOMP_task is:
1) if you call say omp_target_memcpy_async from outside of parallel, it will
   not be actually asynchronous even if you call GOMP_task with if_clause = true
2) if you call it from inside of parallel, it might be scheduled only when
   some host thread is ready for work (e.g. when reaching #pragma omp barrier,
   implicit barrier, #pragma omp taskwait etc.), so even when the offloading
   device is unused but host has lots of work to do, it might take quite a
   while before starting the work, and then one of the OpenMP host threads
   will be blocked waiting for the copying to be done

gomp_create_target_task doesn't have these disadvantages, it can fire off the
copying right away and then just needs to be able to figure out when it
finished (either the unshackeled thread polls the device, or some other way
how to find out that it finished; but OpenMP certainly needs to know that,
because user code can say #pragma omp taskwait for it, or it should be
complete at the end of a taskgroup, or at the end of #pragma omp barrier
or implicit barrier etc.).

Anyway, I guess it is ok to use GOMP_task in the initial patch and change it
later, but if_clause = false and 0, 0 for arg_{size,align} are definitely
wrong.

> +int
> +omp_target_memcpy (void *dst, const void *src, size_t length, size_t 
> dst_offset,
> +                size_t src_offset, int dst_device_num, int src_device_num)
> +{
> +  struct gomp_device_descr *dst_devicep = NULL, *src_devicep = NULL;
> +  int ret;
> +
> +  ret = omp_target_memcpy_check (dst_device_num, src_device_num, 
> &dst_devicep,
> +                              &src_devicep);

You can just use
  int ret = omp_target_memcpy_check (dst_device_num, src_device_num,
                                     &dst_devicep, &src_devicep);

> +int
> +omp_target_memcpy_async (void *dst, const void *src, size_t length,
> +                      size_t dst_offset, size_t src_offset,
> +                      int dst_device_num, int src_device_num,
> +                      int depobj_count, omp_depend_t *depobj_list)
> +{
> +  struct gomp_device_descr *dst_devicep = NULL, *src_devicep = NULL;
> +  void (*fn) (void *) = &omp_target_memcpy_async_helper;

No need for the fn variable, just pass /*fn=*/omp_target_memcpy_async_helper
as the first argument to GOMP_task.

> +  unsigned int flags = 0;
> +  void *data;

No need for the data variable.

> +  void *depend[depobj_count + 5];
> +  int i;
> +  int check = omp_target_memcpy_check (dst_device_num, src_device_num,
> +                                    &dst_devicep, &src_devicep);
> +
> +  omp_target_memcpy_data s = {
> +    .dst = dst,
> +    .src = src,
> +    .length = length,
> +    .dst_offset = dst_offset,
> +    .src_offset = src_offset,
> +    .dst_devicep = dst_devicep,
> +    .src_devicep = src_devicep
> +  };
> +  data = &s;

And the above stmt, just pass &s as the second argument.

> +
> +  if (check)
> +    return check;
> +
> +  depend[0] = 0;
> +  depend[1] = (void *) (uintptr_t) depobj_count;
> +  depend[2] = depend[3] = depend[4] = 0;
> +  for (i = 0; i < depobj_count; ++i)
> +    depend[i + 5] = &depobj_list[i];

This doesn't need to be done if flags will not include
GOMP_TASK_FLAG_DEPEND, so maybe better:

> +
> +  if (depobj_count > 0 && depobj_list != NULL)
> +    flags |= GOMP_TASK_FLAG_DEPEND;

add here
  else
    {
      depend[0] = 0;
...
    }

> +
> +  GOMP_task (fn, data, /*cpyfn=*/NULL, /*arg_size=*/0, /*arg_align=*/0,
> +          /*if_clause=*/false, flags, depend, /*priority_arg=*/0,
> +          /*detach=*/NULL);

Ditto for the other function.

        Jakub

Reply via email to