On Tuesday 23 August 2005 20:26, Martin Simmons wrote: > >>>>> On Tue, 23 Aug 2005 14:44:45 +0200, Kern Sibbald <[EMAIL PROTECTED]> > >>>>> said: > > Kern> On Tuesday 23 August 2005 13:35, Martin Simmons wrote: > >> >>>>> On Tue, 23 Aug 2005 12:30:45 +0200, Kern Sibbald > >> >>>>> <[EMAIL PROTECTED]> said: > > Kern> Hello Volker, > > Kern> I've now found the time to look over your debug output below. My > > >> analysis Kern> leads me to believe that what is show is "impossible". > >> That is the code flow Kern> as created in the source code cannot > >> possibly do what is indicated in the Kern> dump. What is shown in the > >> dump is that the subroutine get_next_jcr_ is Kern> recursively called > >> with the same argument (not possible). This will almost Kern> surely > >> lead to a blocked situation. > > Kern> How could this happen? Bad compiler code, an interrupt that > > >> happens and Kern> restarts the stack at the wrong point, memory error > >> (I doubt), ... > >> > >> I doubt that is really happening -- much more likely is that gdb can't > >> understand the stack. Look at the other threads and you'll see that > >> jobq_server appears to call jobq_server! > >> > >> In all these cases, the extra "call" happens where there is a real > >> call to something like pthread_mutex_lock. The pthread library is > >> probably compiled with too much optimization and/or insufficient debug > >> info for gdb to understand the stack inside there. > > Kern> Yes, that is the first thing I thought of, but forgot to put it on > the list. Kern> However, if that is the case, I cannot explain the hang. > > It looks to me like a deadlock caused by get_next_jcr() locking the mutex > in the jcr. I see that the latest code just locks the jcr chain instead, > so hopefully that fixes it. >
Yes, previously the jcr_chain was locked when traversing the chain to prevent it from changing at all during the traversal, and while working on an individual jcr that jcr was locked. The jcr_lock code was permitted to be "recursively" called by the same thread as long as it unlocked the same number of times. It wasn't really recursive, but that is how I think about it. Now (1.37), despite the fact that the name remains the same, the operation is quite different. The jcr_chain lock is a simple mutex (non-recursive -- i.e. blocks if called two times by the same thread), and the jcr_chain is NOT locked around the traversal of the chain. It is however locked every time the chain is changed, so that the chain can be traversed while it is being modified (at least that is the theory). The locking is only one mutex and at a very low "micro" level rather than a global lock during any traversal. Thus at the expense of more lock/unlocks, the there is only one lock and it is locked for *much* shorter periods, leaving little chance of a race condition, or at least, one that is more easily corrected than something involving multiple mutexes (some of which could be recursively called). -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users