Wait a bit before running the debugger on the other daemons. This backtrace is very different, and seems to show a mutex lock up. I need to look at it in detail ...
On Friday 29 July 2005 23:31, Volker Sauer wrote: > On Fr, 29 Jul 2005, Kern Sibbald <[EMAIL PROTECTED]> wrote: > > What I see from this is that everything in the Director is normal. It > > thinks that something like 5 jobs are running. The threads are all > > waiting on input from one of the other daemons, and there is no mutex > > dead lock situation. So, if everything is locked up, I suspect the > > problem is in one of the other daemons. > > > > I recommend when it is in this state to do a "status" on all the Clients > > and on the SD and see if there is anything interesting going on. Perhaps > > that will tell us the right place to point the debugger. > > Again, the director locked. This time it locked up at the first job > (Client Conc. Jobs = 1) and I was *not* able to connect with bconsole. > Therefore I couldn't get the status from sd or the clients. > > This is what gdb of bacula-dir says: > > > (gdb) run -s -f -c /etc/bacula/bacula-dir.conf > The program being debugged has been started already. > Start it from the beginning? (y or n) y > Starting program: /usr/sbin/bacula-dir -s -f -c > /etc/bacula/bacula-dir.conf > [Thread debugging using libthread_db enabled] > [New Thread 1078020896 (LWP 29834)] > [New Thread 1086450608 (LWP 29837)] > [New Thread 1094839216 (LWP 29838)] > [New Thread 1103227824 (LWP 29857)] > backup-dir: dird.c:438 Director's configuration file reread. > [Thread 1103227824 (LWP 29857) exited] > > [New Thread 1103227824 (LWP 30275)] > backup-dir: dird.c:438 Director's configuration file reread. > [Thread 1103227824 (LWP 30275) exited] > [New Thread 1103227824 (LWP 30574)] > [New Thread 1111620528 (LWP 30575)] > [New Thread 1120074672 (LWP 30577)] > [New Thread 1128463280 (LWP 30578)] > [New Thread 1136851888 (LWP 30580)] > [New Thread 1145240496 (LWP 30581)] > [New Thread 1153629104 (LWP 30582)] > [New Thread 1162017712 (LWP 30644)] > > Program received signal SIGINT, Interrupt. > [Switching to Thread 1078020896 (LWP 29834)] > 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > (gdb) thread apply all bt > > Thread 13 (Thread 1162017712 (LWP 30644)): > #0 0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib/tls/libpthread.so.0 > #1 0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231 > #2 0x0808c8d2 in lock_jcr_chain () at jcr.c:544 > #3 0x0808bd56 in new_jcr (size=1162017184, daemon_free_jcr=0xfffffffc) > at jcr.c:218 > #4 0x0807458c in new_control_jcr (base_name=0xfffffffc <Address > 0xfffffffc out of bounds>, job_type=-4) > at ua_server.c:90 > #5 0x0807468e in handle_UA_client_request (arg=0x80e9d60) at > ua_server.c:122 > #6 0x0809e4db in workq_server (arg=0x80c5920) at workq.c:347 > #7 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #8 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 12 (Thread 1153629104 (LWP 30582)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x080c5b80 in jobs () > #3 0x00000000 in ?? () > #4 0x00000001 in ?? () > #5 0x00000001 in ?? () > #6 0x00000000 in ?? () > #7 0x44c2fad8 in ?? () > #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 11 (Thread 1145240496 (LWP 30581)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x080c5b80 in jobs () > #3 0x00000000 in ?? () > #4 0x00000001 in ?? () > #5 0x00000001 in ?? () > #6 0x00000000 in ?? () > #7 0x4442fad8 in ?? () > #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 10 (Thread 1136851888 (LWP 30580)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x080c5b80 in jobs () > #3 0x00000000 in ?? () > #4 0x00000001 in ?? () > #5 0x00000001 in ?? () > #6 0x00000000 in ?? () > #7 0x43c2fad8 in ?? () > #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 9 (Thread 1128463280 (LWP 30578)): > #0 0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib/tls/libpthread.so.0 > #1 0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231 > #2 0x0808c8d2 in lock_jcr_chain () at jcr.c:544 > #3 0x0805bea4 in jobq_server (arg=0x80c57a0) at jobq.c:582 > #4 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #5 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 8 (Thread 1120074672 (LWP 30577)): > #0 0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0 > #1 0x08084d4c in read_nbytes (bsock=0x80e1140, ptr=0x42c2f82c "@", > nbytes=4) at bnet.c:72 > #2 0x08085067 in bnet_recv (bsock=0x80e1140) at bnet.c:175 > #3 0x08055d88 in bget_dirmsg (bs=0x80e1140) at getmsg.c:79 > #4 0x0805e508 in msg_thread (arg=0x80dcc48) at msgchan.c:235 > #5 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #6 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 7 (Thread 1111620528 (LWP 30575)): > #0 0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0 > #1 0x08084d4c in read_nbytes (bsock=0x80e5f20, > ptr=0x4241f08c "9Q\b\bHÌ\r\b _\016\bXòAB\210]\005\b > [EMAIL PROTECTED]<@[EMAIL PROTECTED] > ", nbytes=4) at bnet.c:72 > #2 0x08085067 in bnet_recv (bsock=0x80e5f20) at bnet.c:175 > #3 0x08055d88 in bget_dirmsg (bs=0x80e5f20) at getmsg.c:79 > #4 0x0804daf8 in wait_for_job_termination (jcr=0x80dcc48) at > backup.c:243 > #5 0x0804da23 in do_backup (jcr=0x80dcc48) at backup.c:207 > #6 0x08058946 in job_thread (arg=0x80dcc48) at job.c:215 > #7 0x0805c08a in jobq_server (arg=0x80c57a0) at jobq.c:444 > #8 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #9 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 6 (Thread 1103227824 (LWP 30574)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x080c5b80 in jobs () > #3 0x080c70b8 in ?? () > #4 0x00000001 in ?? () > #5 0x00000001 in ?? () > #6 0x00000000 in ?? () > #7 0x41c1ead8 in ?? () > #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675 > #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 3 (Thread 1094839216 (LWP 29838)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x080c5b80 in jobs () > #3 0x00000000 in ?? () > #4 0x00000000 in ?? () > #5 0x080e8f50 in ?? () > #6 0x080e8f60 in ?? () > #7 0x4141ea58 in ?? () > #8 0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581 > #9 0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581 > #10 0x080590c8 in job_monitor_watchdog (self=0x80c5b80) at job.c:386 > #11 0x0809dad6 in watchdog_thread (arg=0x0) at watchdog.c:257 > #12 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #13 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 2 (Thread 1086450608 (LWP 29837)): > #0 0x4036ca27 in select () from /lib/tls/libc.so.6 > #1 0x080877e0 in bnet_thread_server (addrs=0x40c1eb90, > max_clients=-514, client_wq=0x80c5920, > handle_client_request=0xfffffdfe) at bnet_server.c:154 > #2 0x08074569 in connect_thread (arg=0xfffffdfe) at ua_server.c:79 > #3 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0 > #4 0x4037318a in clone () from /lib/tls/libc.so.6 > > Thread 1 (Thread 1078020896 (LWP 29834)): > #0 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0x00000006 in ?? () > #3 0x00000069 in ?? () > #4 0x00000005 in ?? () > #5 0x000000d1 in ?? () > #6 0xffffffff in ?? () > #7 0x080e8f50 in ?? () > #8 0xbffff958 in ?? () > #9 0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0x0) at jobq.c:240 > #10 0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0xffffffff) at jobq.c:240 > #11 0x080585fb in run_job (jcr=0x80e8f50) at job.c:140 > #12 0x0804b376 in main (argc=135171920, argv=0x80a0a58) at dird.c:241 > > I could run bacula-sd and bacula-fd on the client paris (at which > usually the jobs stop) under the gdb, too (now, that I have the debug > binaries available). > > Regards > Volker -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users