Re: Test-lock hang (not 100% reproducible) on GNU/Linux

Bruno Haible Wed, 04 Jan 2017 06:17:51 -0800

Pádraig Brady:
> Now that test-lock.c is relatively fast on numa/multicore systems,
> it seems like it would be useful to first alarm(30) or something
> to protect against infinite hangs?


If we could not pinpoint the origin of the problem, I agree, an alarm(30)
would be the right thing to prevent an infinite hang.

But by now, we know

1) It's a glibc bug: The test [6] fails even after it has set the
   policies that POSIX expects for the "writers get the rwlock in preference
   to readers guarantee".

2) Without this guarantee, a reader function that repeatedly spends
     I milliseconds in a section protected by the rwlock,
     O milliseconds without the rwlock being held,
   in a system with N reader threads in parallel
   will lead to
     - a successful termination if   N * I / (I + O) < 1.0
     - an infinite hang if           N * I / (I + O) > 1.0
   (There is actually no discontinuity at 1.0; need to use probability
   calculus for a more detailed analysis.)
   So, in order to make test_rwlock hang-tree, I would need to introduce
   a sleep() without the rwlock being held, and the duration of this sleep
   would be at least (N - 1) * I.

   Now, asking an application writer to add sleep()s in his code, with
   a duration that depends both on the number of threads and on the time
   spent in specific portions of the code, is outrageous.

   So, as it stands, POSIX rwlock without a "writers get preference" guarantee
   is unusable.

I propose to do what we usually do in gnulib, to work around bugs and unusable
APIs:
  - Write a configure test for the guarantee, based on [6].
  - Modify the 'lock' module to use its own implementation of rwlock.
  - Add a unit test to verify the guarantee (so that we can also detect
    if the same problem occurs in pth or Solaris), again based on [6].

Patch in preparation...

Bruno

[6] 
https://github.com/linux-test-project/ltp/blob/master/testcases/open_posix_testsuite/conformance/interfaces/pthread_rwlock_rdlock/2-2.c

diff --git a/lib/glthread/lock.c b/lib/glthread/lock.c
index 6760bbd..215991e 100644
--- a/lib/glthread/lock.c
+++ b/lib/glthread/lock.c
@@ -30,7 +30,7 @@
 
 /* ------------------------- gl_rwlock_t datatype ------------------------- */
 
-# if HAVE_PTHREAD_RWLOCK
+# if HAVE_PTHREAD_RWLOCK && !(__GNU_LIBRARY__ > 1)
 
 #  if !defined PTHREAD_RWLOCK_INITIALIZER
 
@@ -152,11 +152,9 @@ glthread_rwlock_rdlock_multithreaded (gl_rwlock_t *lock)
   if (err != 0)
     return err;
   /* Test whether only readers are currently running, and whether the runcount
-     field will not overflow.  */
-  /* POSIX says: "It is implementation-defined whether the calling thread
-     acquires the lock when a writer does not hold the lock and there are
-     writers blocked on the lock."  Let's say, no: give the writers a higher
-     priority.  */
+     field will not overflow, and whether no writer is waiting.  The latter
+     condition is because POSIX recommends that "write locks shall take
+     precedence over read locks", to avoid "writer starvation".  */
   while (!(lock->runcount + 1 > 0 && lock->waiting_writers_count == 0))
     {
       /* This thread has to wait for a while.  Enqueue it among the
@@ -796,8 +794,10 @@ glthread_rwlock_rdlock_func (gl_rwlock_t *lock)
     }
   EnterCriticalSection (&lock->lock);
   /* Test whether only readers are currently running, and whether the runcount
-     field will not overflow.  */
-  if (!(lock->runcount + 1 > 0))
+     field will not overflow, and whether no writer is waiting.  The latter
+     condition is because POSIX recommends that "write locks shall take
+     precedence over read locks", to avoid "writer starvation".  */
+  if (!(lock->runcount + 1 > 0 && lock->waiting_writers.count == 0))
     {
       /* This thread has to wait for a while.  Enqueue it among the
          waiting_readers.  */
diff --git a/lib/glthread/lock.h b/lib/glthread/lock.h
index 3ea98d6..c808d0e 100644
--- a/lib/glthread/lock.h
+++ b/lib/glthread/lock.h
@@ -176,7 +176,7 @@ typedef pthread_mutex_t gl_lock_t;
 
 /* ------------------------- gl_rwlock_t datatype ------------------------- */
 
-# if HAVE_PTHREAD_RWLOCK
+# if HAVE_PTHREAD_RWLOCK && !(__GNU_LIBRARY__ > 1)
 
 #  ifdef PTHREAD_RWLOCK_INITIALIZER

Re: Test-lock hang (not 100% reproducible) on GNU/Linux

Reply via email to