https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835
--- Comment #3 from Thomas Schwinge <tschwinge at gcc dot gnu.org> --- Created attachment 45457 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45457&action=edit [WIP] libgomp.oacc-c-c++-common/asyncwait-1.c debug (In reply to Tom de Vries from comment #2) > (In reply to Tom de Vries from comment #1) > > (In reply to Thomas Schwinge from comment #0) > > > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm > > > seeing (intermittently only, and only on some systems): > > > > I see the failure reproduced consistently with a Quadro M1200. Oh, good -- in a way ;-) -- that it's consistently reproducable for you. For me, the failure is rather rare. > > > I have not yet analyzed what's causing this, but I have some ideas about > > > pending patches that might cure it. Unfortunately, the patches I've been thinking of either are on trunk already, or can't possibly be related to this problem. The 'async'/'wait' clauses/directives in the test case look correct. > do you intend to address this before stage4 closes? I'd like to, yes. Here is my current status. With "-O2": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_data_end: restore mappings GOACC_data_end: mappings restored [abort] In addition to "main$_omp_fn$37", sometimes also seen with "main$_omp_fn$25", "main$_omp_fn$29", "main$_omp_fn$33". So far only seen with OpenACC 'kernels' constructs, but not with the very similar 'parallel' ones earlier in the file. For example, without "DEBUG_K": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_wait -2 1 goacc_wait -2 1 goacc_wait 1 GOACC_data_end: restore mappings GOACC_data_end: mappings restored 1007 c[64] 0 1019 e[64] 13 1007 c[65] 0 1019 e[65] 13 1007 c[66] 0 1019 e[66] 13 [...] 1007 c[125] 0 1019 e[125] 13 1007 c[126] 0 1019 e[126] 13 1007 c[127] 0 1019 e[127] 13 With "DEBUG_K": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_wait -2 1 goacc_wait -2 1 goacc_wait 1 966 c[64] 0 966 c[65] 0 966 c[66] 0 [...] 966 c[125] 0 966 c[126] 0 966 c[127] 0 So, the compute kernel ("main$_omp_fn$37") doesn't find the "c" array properly initialized, even though they're enqueued on the same 'async', so have to execute in proper order by definition. I've only ever seen this with the "c" array. Sometimes that's starting already with index 0 (often seen with "main$_omp_fn$29"), or as late as index 100 (rarely). When running under "valgrind", repeatedly until there's an "abort", that doesn't print anything suspicious. Might this perhaps be a latent issue in OpenACC 'kernels' plus 'async', now uncovered by the r264397 "[nvptx] Remove use of CUDA unified memory in libgomp" commit?