Re: [Bacula-users] Re: [Bacula-devel] Severe problem: director hangs in production system

Kern Sibbald Tue, 23 Aug 2005 12:09:48 -0700

On Tuesday 23 August 2005 20:26, Martin Simmons wrote:
> >>>>> On Tue, 23 Aug 2005 14:44:45 +0200, Kern Sibbald <[EMAIL PROTECTED]>
> >>>>> said:
>
>   Kern> On Tuesday 23 August 2005 13:35, Martin Simmons wrote:
>   >> >>>>> On Tue, 23 Aug 2005 12:30:45 +0200, Kern Sibbald
>   >> >>>>> <[EMAIL PROTECTED]> said:
>
>   Kern> Hello Volker,
>
>   Kern> I've now found the time to look over your debug output below.  My
>
>   >> analysis Kern> leads me to believe that what is show is "impossible".
>   >> That is the code flow Kern> as created in the source code cannot
>   >> possibly do what is indicated in the Kern> dump.  What is shown in the
>   >> dump is that the subroutine get_next_jcr_ is Kern> recursively called
>   >> with the same argument (not possible).  This will almost Kern> surely
>   >> lead to a blocked situation.
>
>   Kern> How could this happen?  Bad compiler code, an interrupt that
>
>   >> happens and Kern> restarts the stack at the wrong point, memory error
>   >> (I doubt), ...
>   >>
>   >> I doubt that is really happening -- much more likely is that gdb can't
>   >> understand the stack.  Look at the other threads and you'll see that
>   >> jobq_server appears to call jobq_server!
>   >>
>   >> In all these cases, the extra "call" happens where there is a real
>   >> call to something like pthread_mutex_lock.  The pthread library is
>   >> probably compiled with too much optimization and/or insufficient debug
>   >> info for gdb to understand the stack inside there.
>
>   Kern> Yes, that is the first thing I thought of, but forgot to put it on
> the list. Kern> However, if that is the case, I cannot explain the hang.
>
> It looks to me like a deadlock caused by get_next_jcr() locking the mutex
> in the jcr.  I see that the latest code just locks the jcr chain instead,
> so hopefully that fixes it.
>


Yes, previously the jcr_chain was locked when traversing the chain to prevent 
it from changing at all during the traversal, and while working on an 
individual jcr that jcr was locked.  The jcr_lock code was permitted to be 
"recursively" called by the same thread as long as it unlocked the same 
number of times.  It wasn't really recursive, but that is how I think about 
it.

Now (1.37), despite the fact that the name remains the same, the operation is 
quite different. The jcr_chain lock is a simple mutex (non-recursive -- i.e. 
blocks if called two times by the same thread), and the jcr_chain is NOT 
locked around the traversal of the chain.  It is however locked every time 
the chain is changed, so that the chain can be traversed while it is being 
modified (at least that is the theory).  The locking is only one mutex and at 
a very low "micro" level rather than a global lock during any traversal.  
Thus at the expense of more lock/unlocks, the there is only one lock and it 
is locked for *much* shorter periods, leaving little chance of a race 
condition, or at least, one that is more easily corrected than something 
involving multiple mutexes (some of which could be recursively called).

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Re: [Bacula-devel] Severe problem: director hangs in production system

Reply via email to