Hi! On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iver...@gmail.com> wrote: > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote: > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote: > > > +/* This function finalizes all initialized devices. */ > > > + > > > +static void > > > +gomp_target_fini (void) > > > +{ > > > + [...] > > > > The question is what will this do if there are async target tasks still > > running on some of the devices at this point (forgotten #pragma omp taskwait > > or similar if target nowait regions are started outside of parallel region, > > or exit inside of parallel, etc. But perhaps it can be handled > > incrementally. > > Also there is the question that the > > So I think the patch is ok with the above mentioned changes. > > Here is what I've committed to trunk.
> --- a/libgomp/libgomp.h > +++ b/libgomp/libgomp.h > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t > } cuda; > } acc_dispatch_t; > > +/* Various state of the accelerator device. */ > +enum gomp_device_state > +{ > + GOMP_DEVICE_UNINITIALIZED, > + GOMP_DEVICE_INITIALIZED, > + GOMP_DEVICE_FINALIZED > +}; > + > /* This structure describes accelerator device. > It contains name of the corresponding libgomp plugin, function handlers > for > interaction with the device, ID-number of the device, and information > about > @@ -933,8 +941,10 @@ struct gomp_device_descr > /* Mutex for the mutable data. */ > gomp_mutex_t lock; > > - /* Set to true when device is initialized. */ > - bool is_initialized; > + /* Current state of the device. OpenACC allows to move from INITIALIZED > state > + back to UNINITIALIZED state. OpenMP allows only to move from > INITIALIZED > + to FINALIZED state (at program shutdown). */ > + enum gomp_device_state state; (ACK, but I assume we'll want to make sure that an OpenACC device is never re-initialized if we're in/after the libgomp finalization phase.) The issue mentioned above: "exit inside of parallel" is actually a problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c, libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90 test cases now run into annoying "WARNING: program timed out". Here is what's happening, as I understand it: in libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal. > --- a/libgomp/target.c > +++ b/libgomp/target.c > +/* This function finalizes all initialized devices. */ > + > +static void > +gomp_target_fini (void) > +{ > + int i; > + for (i = 0; i < num_devices; i++) > + { > + struct gomp_device_descr *devicep = &devices[i]; > + gomp_mutex_lock (&devicep->lock); > + if (devicep->state == GOMP_DEVICE_INITIALIZED) > + { > + devicep->fini_device_func (devicep->target_id); > + devicep->state = GOMP_DEVICE_FINALIZED; > + } > + gomp_mutex_unlock (&devicep->lock); > + } > +} > @@ -2387,6 +2433,9 @@ gomp_target_init (void) > if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200) > goacc_register (&devices[i]); > } > + > + if (atexit (gomp_target_fini) != 0) > + gomp_fatal ("atexit failed"); > } Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the atexit handler, gomp_target_fini, which, with the device lock held, will call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to clean up. Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA context is now in an inconsistent state, see <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>: CUDA_ERROR_LAUNCH_FAILED = 719 An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA. Thus, any cuMemFreeHost invocations that are run during clean-up will now also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call GOMP_PLUGIN_fatal, which again will trigger the same or another (GOMP_offload_unregister_ver) atexit handler, which will then deadlock trying to lock the device again, which is still locked. (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar errors, should we destroy the context right away, or toggle a boolean flag to mark it as unusable, and use that as an indication to avoid the follow-on failures of cuMemFreeHost just described above, for example?) <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html> tells us: Since the behavior is undefined if the exit() function is called more than once, portable applications calling atexit() must ensure that the exit() function is not called at normal process termination when all functions registered by the atexit() function are called. ... which we're violating here, at least in the nvptx plugin. I have not analyzed the intermic one. As it happens, Chung-Lin has been working in that area: <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>, which he recently re-posted: <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>, <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>, <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>. I have not analyzed whether his changes would completely resolve the problem just described, but at least conceptually they seem to be a step into the right direction? (Jakub?) Now, to resolve the immediate problem, what is the right thing for us to do? Is the following simple change OK, or is there a reason to still run atexit handlers if terminating under error conditions? commit b1733e8f9df6ae7d6828e2194df1b314772701c5 Author: Thomas Schwinge <tho...@codesourcery.com> Date: Wed Dec 16 13:10:39 2015 +0100 Avoid deadlocks in libgomp due to competing atexit handlers libgomp/ * error.c (gomp_vfatal): Call _exit instead of exit. --- libgomp/error.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git libgomp/error.c libgomp/error.c index 094c24a..1ef7491 100644 --- libgomp/error.c +++ libgomp/error.c @@ -34,6 +34,7 @@ #include <stdarg.h> #include <stdio.h> #include <stdlib.h> +#include <unistd.h> #undef gomp_vdebug @@ -77,7 +78,7 @@ void gomp_vfatal (const char *fmt, va_list list) { gomp_verror (fmt, list); - exit (EXIT_FAILURE); + _exit (EXIT_FAILURE); } void Grüße Thomas
signature.asc
Description: PGP signature