Hi, I'm using bacula 3.0.3 and the director's job queue was stuck after running the first job. The others were waiting indefinitely for execution. If the director was restarted, I could run only one job, and so on.
Googling around I found these 2 posts without satisfying anwsers : http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/bacula-25/upgrade-to-3-0-3-job-is-waiting-for-execution-102156/ http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/bacula-25/job-is-waiting-for-execuition-101508/ I then looked at the code and found there is a deadlock happening in message handling. The problem is located in close_msg(JCR *) function in message.c. When it encounters an error while sending an e-mail, it calls the macro Jmsg1 (line 485) to report it. This macro calls dispatch_message, which tries to acquire fides_mutex (line 738). Unfortunatly, this mutex was already acquired in close_msg (line 431), thus resulting in a deadlock (as stated in mutex documentation for PTHREAD_MUTEX_INITIALIZER kind). This problem was affecting me because mail daemon was not properly configured on my server. It could be interesting to review these parts of the code to avoid such situation. However I wrote a quick patch for lockmgr.c which simply upgrades mutexes to PTHREAD_MUTEX_ERRORCHECK_NP kind and resolves this error. Hope this would help someone, Renaud patch : diff -rupN bacula-3.0.3.vanilla/src/lib/lockmgr.c bacula-3.0.3.patched/src/lib/lockmgr.c --- bacula-3.0.3.vanilla/src/lib/lockmgr.c 2009-10-18 11:10:16.000000000 +0200 +++ bacula-3.0.3.patched/src/lib/lockmgr.c 2009-12-31 18:05:59.000000000 +0100 @@ -616,6 +616,15 @@ void lmgr_cleanup_main() */ int lmgr_mutex_lock(pthread_mutex_t *m, const char *file, int line) { + /* Patch to avoid deadlock if mutex is locked more than once */ + /* There's some performance hit which makes it probably not acceptable */ + /* for large system usage. */ + if(*m == PTHREAD_MUTEX_INITIALIZER) { + pthread_mutexattr_t attr; + pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_ERRORCHECK_NP ); + pthread_mutex_init( m, &attr ); + } + int ret; lmgr_thread_t *self = lmgr_get_thread_info(); self->pre_P(m, file, line); ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users