Stabilize flaky GCN target/offloading testing

Thomas Schwinge Wed, 06 Mar 2024 04:10:03 -0800

Hi!

On 2024-02-21T17:32:13+0100, Richard Biener <rguent...@suse.de> wrote:
> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwi...@baylibre.com>:
>> [...] per my work on <https://gcc.gnu.org/PR66005>
>> "libgomp make check time is excessive", all execution testing in libgomp
>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
>> (... with the caveat that execution tests for
>> effective-targets are *not* governed by that, as I've found yesterday.
>> I have a WIP hack for that, too.)


>> What disturbs the testing a lot is, that the GPU may get into a bad
>> state, upon which any use either fails with a
>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
>> 'libhsa-runtime64.so.1'...
>> 
>> I've now tried to debug the latter case (hang).  When the GPU gets into
>> this bad state (whatever exactly that is),
>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
>> There it hangs until killed (for example, until DejaGnu's timeout
>> mechanism kills the process -- just that the next GPU-using execution
>> test then runs into the same thing again...).
>> 
>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
>> we're able to recover via:
>> 
>>    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>>    0

At least most of the times.  I've found that -- sometimes... ;-( -- if
you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
by injecting some artificial "cool-down period"...  (The latter I've not
yet tested extensively.)

>> This is, obviously, a hack, probably needs a serial lock to not disturb
>> other things, has hard-coded 'dri/0', and as I said in
>> <https://inbox.sourceware.org/87plww8qin....@euler.schwinge.ddns.net>
>> "GCN RDNA2+ vs. GCC SLP vectorizer":
>> 
>> | I've no idea what
>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
>
> It ends up terminating your X session…

Eh....  ;'-|

> (there’s some automatic driver recovery that’s also sometimes triggered which 
> sounds like the same thing).

> I need to try using the integrated graphics for X11 to see if that avoids the 
> issue.

A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
remember correctly -- basically got it to work, via hand-editing
'/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
to work in that setup, and therefore reverted to "standard".

> Guess AMD needs to improve the driver/runtime (or we - it’s open source at 
> least up to the firmware).

>> However, it's very useful in my testing.  :-|
>> 
>> The questions is, how to detect the "hang" state without first running
>> into a timeout (and disambiguating such a timeout from a user code
>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
>> initialization, and before the actual GPU kernel launch cancel it with
>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
>> error message that we can then react on, like for
>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
>> no-go in libgomp -- instead, use a helper thread to similarly implement a
>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
>> other purposes.)  Any other clever ideas?  What's a suitable value for
>> "a few seconds"?

I'm attaching my current "GCN: Watchdog for device image load", covering
both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
(That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )

That, plus routing *all* potential GPU usage (in particular: including
execution tests for effective-targets, see above) through a serial lock
('flock', implemented in DejaGnu board file, outside of the the
"DejaGnu timeout domain", similar to
'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
the "fake" ones via "GCN: Watchdog for device image load") and in that
case 'amdgpu_gpu_recover' and re-execution of the respective executable,
does greatly stabilize flaky GCN target/offloading testing.

Do we have consensus to move forward with this approach, generally?


Grüße
 Thomas

>From 21795353483c263c91a5efa80da41a75a6b2b629 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwi...@baylibre.com>
Date: Thu, 22 Feb 2024 21:50:45 +0100
Subject: [PATCH] GCN: Watchdog for device image load

---
 gcc/config/gcn/gcn-run.cc   | 76 ++++++++++++++++++++++++++++++++++
 libgomp/plugin/plugin-gcn.c | 81 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 156 insertions(+), 1 deletion(-)

diff --git a/gcc/config/gcn/gcn-run.cc b/gcc/config/gcn/gcn-run.cc
index d45ff3e6c2ba..ab15185af471 100644
--- a/gcc/config/gcn/gcn-run.cc
+++ b/gcc/config/gcn/gcn-run.cc
@@ -33,6 +33,8 @@
 #include <unistd.h>
 #include <elf.h>
 #include <signal.h>
+#include <time.h>
+#include <errno.h>
 
 #include "hsa.h"
 #include "../../../libgomp/config/gcn/libgomp-gcn.h"
@@ -616,6 +618,70 @@ run (uint64_t kernel, void *kernargs)
 	"Clean up signal");
 }
 
+/* Watchdog.  */
+
+static void
+watchdog_bark (union sigval sigev_value)
+{
+  const char *msg = sigev_value.sival_ptr;
+  fprintf (stderr, "Watchdog barking %s\n", msg);
+  exit (EXIT_FAILURE);
+}
+
+static void
+watchdog_start (timer_t *restrict timeridp, const int s, const char *msg)
+{
+  if (debug)
+    fprintf (stderr, "Starting watchdog\n");
+
+  struct sigevent sev;
+  sev.sigev_notify = SIGEV_THREAD;
+  sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg;
+  sev.sigev_notify_function = watchdog_bark;
+  sev.sigev_notify_attributes = NULL;
+  int res;
+  /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts.  */
+  int32_t wait_ns = 255;
+  while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN
+	 && wait_ns <= 999999999)
+    {
+      if (debug)
+	fprintf (stderr, "'timer_create': 'EAGAIN'; waiting %d ns\n",
+		 (int) wait_ns);
+      struct timespec wait_ts = { 0, wait_ns };
+      (void) nanosleep (&wait_ts, NULL);
+      wait_ns <<= 1;
+    }
+  if (res != 0)
+    {
+      perror ("'timer_create' FAILED");
+      exit (EXIT_FAILURE);
+    }
+
+  struct itimerspec its = { { 0, 0 }, { s, 0 } };
+  res = timer_settime (*timeridp, 0, &its, NULL);
+  if (res != 0)
+    {
+      perror ("'timer_settime' FAILED");
+      exit (EXIT_FAILURE);
+    }
+}
+
+static void
+watchdog_stop (timer_t timerid)
+{
+  int res;
+  res = timer_delete (timerid);
+  if (res != 0)
+    {
+      perror ("'timer_delete' FAILED");
+      exit (EXIT_FAILURE);
+    }
+
+  if (debug)
+    fprintf (stderr, "Stopped watchdog\n");
+}
+
 int
 main (int argc, char *argv[])
 {
@@ -658,7 +724,17 @@ main (int argc, char *argv[])
   char **kernel_argv = &argv[kernel_arg];
 
   init_device ();
+
+  /* Something's wrong if the device image load doesn't complete quickly;
+     <https://inbox.sourceware.org/87il2ij8sm....@euler.schwinge.ddns.net>
+     "Stabilizing flaky libgomp GCN target/offloading testing".  */
+  timer_t watchdog;
+  static const int watchdog_s = 10;
+  watchdog_start (&watchdog, watchdog_s,
+		  "during device image load; maybe handle similar to"
+		  " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?");
   load_image (kernel_argv[0]);
+  watchdog_stop (watchdog);
 
   /* Calculate size of function parameters + argv data.  */
   size_t args_size = 0;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 2771123252a8..5680d9f5a34a 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -48,6 +48,8 @@
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 #include <assert.h>
+#include <time.h>
+#include <errno.h>
 
 /* These probably won't be in elf.h for a while.  */
 #ifndef R_AMDGPU_NONE
@@ -1371,6 +1373,71 @@ hsa_queue_callback (hsa_status_t status,
   hsa_fatal ("Asynchronous queue error", status);
 }
 
+/* }}}  */
+/* {{{ Watchdog  */
+
+static void
+watchdog_bark (union sigval sigev_value)
+{
+  const char *msg = sigev_value.sival_ptr;
+  GOMP_PLUGIN_error ("GCN fatal error: watchdog barking %s\n", msg);
+  _Exit (EXIT_FAILURE);
+}
+
+static void
+watchdog_start (timer_t *restrict timeridp, const int s, const char *msg)
+{
+  GCN_DEBUG ("Starting watchdog\n");
+
+  struct sigevent sev;
+  sev.sigev_notify = SIGEV_THREAD;
+  sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg;
+  sev.sigev_notify_function = watchdog_bark;
+  sev.sigev_notify_attributes = NULL;
+  int res;
+  /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts.  */
+  int32_t wait_ns = 255;
+  while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN
+	 && wait_ns <= 999999999)
+    {
+      GCN_DEBUG ("'timer_create': 'EAGAIN'; waiting %d ns\n",
+		 (int) wait_ns);
+      struct timespec wait_ts = { 0, wait_ns };
+      (void) nanosleep (&wait_ts, NULL);
+      wait_ns <<= 1;
+    }
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_create' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+
+  struct itimerspec its = { { 0, 0 }, { s, 0 } };
+  res = timer_settime (*timeridp, 0, &its, NULL);
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_settime' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+}
+
+static void
+watchdog_stop (timer_t timerid)
+{
+  int res;
+  res = timer_delete (timerid);
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_delete' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+
+  GCN_DEBUG ("Stopped watchdog\n");
+}
+
 /* }}}  */
 /* {{{ HSA initialization  */
 
@@ -2502,7 +2569,16 @@ create_and_finalize_hsa_program (struct agent_info *agent)
       return false;
     }
   if (agent->prog_finalized)
-    goto final;
+    goto unlock;
+
+  /* Something's wrong if the device image load doesn't complete quickly;
+     <https://inbox.sourceware.org/87il2ij8sm....@euler.schwinge.ddns.net>
+     "Stabilizing flaky libgomp GCN target/offloading testing".  */
+  timer_t watchdog;
+  static const int watchdog_s = 10;
+  watchdog_start (&watchdog, watchdog_s,
+		  "during device image load; maybe handle similar to"
+		  " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?");
 
   status
     = hsa_fns.hsa_executable_create_fn (HSA_PROFILE_FULL,
@@ -2581,6 +2657,9 @@ create_and_finalize_hsa_program (struct agent_info *agent)
 final:
   agent->prog_finalized = true;
 
+  watchdog_stop (watchdog);
+
+unlock:
   if (pthread_mutex_unlock (&agent->prog_mutex))
     {
       GOMP_PLUGIN_error ("Could not unlock a GCN agent program mutex");
-- 
2.43.0

Stabilize flaky GCN target/offloading testing

Reply via email to