Hi!

Ping.

On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iver...@gmail.com> wrote:
> > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  [...]
> > > 
> > > The question is what will this do if there are async target tasks still
> > > running on some of the devices at this point (forgotten #pragma omp 
> > > taskwait
> > > or similar if target nowait regions are started outside of parallel 
> > > region,
> > > or exit inside of parallel, etc.  But perhaps it can be handled 
> > > incrementally.
> > > Also there is the question that the 
> > > So I think the patch is ok with the above mentioned changes.
> > 
> > Here is what I've committed to trunk.
> 
> > --- a/libgomp/libgomp.h
> > +++ b/libgomp/libgomp.h
> > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> >    } cuda;
> >  } acc_dispatch_t;
> >  
> > +/* Various state of the accelerator device.  */
> > +enum gomp_device_state
> > +{
> > +  GOMP_DEVICE_UNINITIALIZED,
> > +  GOMP_DEVICE_INITIALIZED,
> > +  GOMP_DEVICE_FINALIZED
> > +};
> > +
> >  /* This structure describes accelerator device.
> >     It contains name of the corresponding libgomp plugin, function handlers 
> > for
> >     interaction with the device, ID-number of the device, and information 
> > about
> > @@ -933,8 +941,10 @@ struct gomp_device_descr
> >    /* Mutex for the mutable data.  */
> >    gomp_mutex_t lock;
> >  
> > -  /* Set to true when device is initialized.  */
> > -  bool is_initialized;
> > +  /* Current state of the device.  OpenACC allows to move from INITIALIZED 
> > state
> > +     back to UNINITIALIZED state.  OpenMP allows only to move from 
> > INITIALIZED
> > +     to FINALIZED state (at program shutdown).  */
> > +  enum gomp_device_state state;
> 
> (ACK, but I assume we'll want to make sure that an OpenACC device is
> never re-initialized if we're in/after the libgomp finalization phase.)
> 
> 
> The issue mentioned above: "exit inside of parallel" is actually a
> problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> test cases now run into annoying "WARNING: program timed out".  Here is
> what's happening, as I understand it: in
> libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> 
> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> 
> > +/* This function finalizes all initialized devices.  */
> > +
> > +static void
> > +gomp_target_fini (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < num_devices; i++)
> > +    {
> > +      struct gomp_device_descr *devicep = &devices[i];
> > +      gomp_mutex_lock (&devicep->lock);
> > +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > +   {
> > +     devicep->fini_device_func (devicep->target_id);
> > +     devicep->state = GOMP_DEVICE_FINALIZED;
> > +   }
> > +      gomp_mutex_unlock (&devicep->lock);
> > +    }
> > +}
> 
> > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> >        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> >     goacc_register (&devices[i]);
> >      }
> > +
> > +  if (atexit (gomp_target_fini) != 0)
> > +    gomp_fatal ("atexit failed");
> >  }
> 
> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> atexit handler, gomp_target_fini, which, with the device lock held, will
> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> clean up.
> 
> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> context is now in an inconsistent state, see
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:
> 
>     CUDA_ERROR_LAUNCH_FAILED = 719
>         An exception occurred on the device while executing a
>         kernel. Common causes include dereferencing an invalid device
>         pointer and accessing out of bounds shared memory. The context
>         cannot be used, so it must be destroyed (and a new one should be
>         created). All existing device memory allocations from this
>         context are invalid and must be reconstructed if the program is
>         to continue using CUDA.
> 
> Thus, any cuMemFreeHost invocations that are run during clean-up will now
> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> GOMP_PLUGIN_fatal, which again will trigger the same or another
> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> trying to lock the device again, which is still locked.
> 
> (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> errors, should we destroy the context right away, or toggle a boolean
> flag to mark it as unusable, and use that as an indication to avoid the
> follow-on failures of cuMemFreeHost just described above, for example?)
> 
> <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
> tells us:
> 
>     Since the behavior is undefined if the exit() function is called more
>     than once, portable applications calling atexit() must ensure that the
>     exit() function is not called at normal process termination when all
>     functions registered by the atexit() function are called.
> 
> ... which we're violating here, at least in the nvptx plugin.  I have not
> analyzed the intermic one.
> 
> As it happens, Chung-Lin has been working in that area:
> <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
> which he recently re-posted:
> <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
> I have not analyzed whether his changes would completely resolve the
> problem just described, but at least conceptually they seem to be a step
> into the right direction?  (Jakub?)
> 
> Now, to resolve the immediate problem, what is the right thing for us to
> do?  Is the following simple change OK, or is there a reason to still run
> atexit handlers if terminating under error conditions?
> 
> commit b1733e8f9df6ae7d6828e2194df1b314772701c5
> Author: Thomas Schwinge <tho...@codesourcery.com>
> Date:   Wed Dec 16 13:10:39 2015 +0100
> 
>     Avoid deadlocks in libgomp due to competing atexit handlers
>     
>       libgomp/
>       * error.c (gomp_vfatal): Call _exit instead of exit.
> ---
>  libgomp/error.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git libgomp/error.c libgomp/error.c
> index 094c24a..1ef7491 100644
> --- libgomp/error.c
> +++ libgomp/error.c
> @@ -34,6 +34,7 @@
>  #include <stdarg.h>
>  #include <stdio.h>
>  #include <stdlib.h>
> +#include <unistd.h>
>  
>  
>  #undef gomp_vdebug
> @@ -77,7 +78,7 @@ void
>  gomp_vfatal (const char *fmt, va_list list)
>  {
>    gomp_verror (fmt, list);
> -  exit (EXIT_FAILURE);
> +  _exit (EXIT_FAILURE);
>  }
>  
>  void


Grüße
 Thomas

Attachment: signature.asc
Description: PGP signature

Reply via email to