Re: Test-lock hang (not 100% reproducible) on GNU/Linux

Pavel Raiskup Wed, 04 Jan 2017 06:49:49 -0800

On Wednesday, January 4, 2017 3:17:01 PM CET Bruno Haible wrote:
> Pádraig Brady:
> > Now that test-lock.c is relatively fast on numa/multicore systems,
> > it seems like it would be useful to first alarm(30) or something
> > to protect against infinite hangs?
> 
> If we could not pinpoint the origin of the problem, I agree, an alarm(30)
> would be the right thing to prevent an infinite hang.
> 
> But by now, we know
> 
> 1) It's a glibc bug: The test [6] fails even after it has set the
>    policies that POSIX expects for the "writers get the rwlock in preference
>    to readers guarantee".
> 
> 2) Without this guarantee, a reader function that repeatedly spends
>      I milliseconds in a section protected by the rwlock,
>      O milliseconds without the rwlock being held,
>    in a system with N reader threads in parallel
>    will lead to
>      - a successful termination if   N * I / (I + O) < 1.0
>      - an infinite hang if           N * I / (I + O) > 1.0
>    (There is actually no discontinuity at 1.0; need to use probability
>    calculus for a more detailed analysis.)
>    So, in order to make test_rwlock hang-tree, I would need to introduce
>    a sleep() without the rwlock being held, and the duration of this sleep
>    would be at least (N - 1) * I.
> 
>    Now, asking an application writer to add sleep()s in his code, with
>    a duration that depends both on the number of threads and on the time
>    spent in specific portions of the code, is outrageous.
> 
>    So, as it stands, POSIX rwlock without a "writers get preference" guarantee
>    is unusable.


If we don't played with probability a bit longer, I'm still afraid this
moves the problem somewhere else ... because if writers had preference,
and those were able to held rwlock all the time, readers would starve.

I agree that gl_pthread_rwlock* should match the specification, but at
least in the actual algorithm in test_rwlock() we should make sure that
some readers are actually doing something _during_ writers' typhoon...
(this is not hang of test_rwlock() anymore, but certainly we want to test
something..).

Pavel


> I propose to do what we usually do in gnulib, to work around bugs and unusable
> APIs:
>   - Write a configure test for the guarantee, based on [6].
>   - Modify the 'lock' module to use its own implementation of rwlock.
>   - Add a unit test to verify the guarantee (so that we can also detect
>     if the same problem occurs in pth or Solaris), again based on [6].
> 
> Patch in preparation...
> 
> Bruno
> 
> [6] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/open_posix_testsuite/conformance/interfaces/pthread_rwlock_rdlock/2-2.c

Re: Test-lock hang (not 100% reproducible) on GNU/Linux

Reply via email to