Re: Too strict synchronization with the local (host) thread?

Chung-Lin Tang Tue, 11 Dec 2018 05:31:21 -0800

On 2018/12/7 11:56 PM, Thomas Schwinge wrote:

--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
@@ -114,6 +114,7 @@ main (int argc, char **argv)

for (i = 0; i < N; i++)

      {
+      stream = (CUstream) acc_get_cuda_stream (i & 1);
        r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);

What's the motivation for this change?


To place work on both streams 0 and 1.

..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.

Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:

     [...]
       for (i = 0; i < N; i++)
         {
           r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 
0);
           if (r != CUDA_SUCCESS)
             {
               fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
               abort ();
             }
         }

This launches N kernels on N separate async queues/CUDA streams, [0..N).

       acc_wait_all_async (N);

Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not*  synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).

       for (i = 0; i <= N; i++)
         {
           if (acc_async_test (i) != 0)
             abort ();
         }

Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).

       acc_wait (N);

Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).

       for (i = 0; i <= N; i++)
         {
           if (acc_async_test (i) != 1)
             abort ();
         }
     [...]

So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.


Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?

Or, of course, I'm misunderstanding something...


IIRC, we encountered many issues where people misunderstood the meaning of 
"wait+async",
using it as if the local host sync happened, where in our original 
implementation it does not.

Also some areas of the OpenACC spec were vague on whether the local host 
synchronization should
or should not happen; basically, the wording treated as if it was only an 
implementation detail
and didn't matter, and didn't acknowledge that this would be something visible 
to the user.

At the end, IIRC, I decided that adding a local host synchronization is easier 
for all of us,
and took the opportunity of the re-org to make this change.

That said, I didn't notice those tests you listed above were meant to test such 
delicate behavior.

(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much?  (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)


It's not hard to restore the old behavior, just a few lines to delete. Although 
as described
above, this change was deliberate.

This might be another issue to raise with the committee. I think I tried on 
this exact issue
a long time ago, but never got answers.

Thanks,
Chung-Lin

Re: Too strict synchronization with the local (host) thread?

Reply via email to