On 2018/12/7 11:56 PM, Thomas Schwinge wrote:
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
@@ -114,6 +114,7 @@ main (int argc, char **argv)
for (i = 0; i < N; i++)
{
+ stream = (CUstream) acc_get_cuda_stream (i & 1);
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);
What's the motivation for this change?
To place work on both streams 0 and 1.
..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.
Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:
[...]
for (i = 0; i < N; i++)
{
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs,
0);
if (r != CUDA_SUCCESS)
{
fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
abort ();
}
}
This launches N kernels on N separate async queues/CUDA streams, [0..N).
acc_wait_all_async (N);
Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not* synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).
for (i = 0; i <= N; i++)
{
if (acc_async_test (i) != 0)
abort ();
}
Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).
acc_wait (N);
Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).
for (i = 0; i <= N; i++)
{
if (acc_async_test (i) != 1)
abort ();
}
[...]
So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.
Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?
Or, of course, I'm misunderstanding something...
IIRC, we encountered many issues where people misunderstood the meaning of
"wait+async",
using it as if the local host sync happened, where in our original
implementation it does not.
Also some areas of the OpenACC spec were vague on whether the local host
synchronization should
or should not happen; basically, the wording treated as if it was only an
implementation detail
and didn't matter, and didn't acknowledge that this would be something visible
to the user.
At the end, IIRC, I decided that adding a local host synchronization is easier
for all of us,
and took the opportunity of the re-org to make this change.
That said, I didn't notice those tests you listed above were meant to test such
delicate behavior.
(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much? (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)
It's not hard to restore the old behavior, just a few lines to delete. Although
as described
above, this change was deliberate.
This might be another issue to raise with the committee. I think I tried on
this exact issue
a long time ago, but never got answers.
Thanks,
Chung-Lin