Hello Volker,

I've now found the time to look over your debug output below.  My analysis 
leads me to believe that what is show is "impossible". That is the code flow 
as created in the source code cannot possibly do what is indicated in the 
dump.  What is shown in the dump is that the subroutine get_next_jcr_ is 
recursively called with the same argument (not possible).  This will almost 
surely lead to a blocked situation.

How could this happen?  Bad compiler code, an interrupt that happens and 
restarts the stack at the wrong point, memory error (I doubt), ...

From what I see there is very little I can do.

I've marked the place in the dump below where it is going wrong -- Thread 3 
stack levels 8 and 9.

On Friday 29 July 2005 23:31, Volker Sauer wrote:
> On Fr, 29 Jul 2005, Kern Sibbald <[EMAIL PROTECTED]> wrote:
> > What I see from this is that everything in the Director is normal.  It
> > thinks that something like 5 jobs are running.  The threads are all
> > waiting on input from one of the other daemons, and there is no mutex
> > dead lock situation. So, if everything is locked up, I suspect the
> > problem is in one of the other daemons.
> >
> > I recommend when it is in this state to do a "status" on all the Clients
> > and on the SD and see if there is anything interesting going on. Perhaps
> > that will tell us the right place to point the debugger.
>
> Again, the director locked. This time it locked up at the first job
> (Client Conc. Jobs = 1) and I was *not* able to connect with bconsole.
> Therefore I couldn't get the status from sd or the clients.
>
> This is what gdb of bacula-dir says:
>
>
> (gdb) run -s -f -c /etc/bacula/bacula-dir.conf
> The program being debugged has been started already.
> Start it from the beginning? (y or n) y
> Starting program: /usr/sbin/bacula-dir -s -f -c
> /etc/bacula/bacula-dir.conf
> [Thread debugging using libthread_db enabled]
> [New Thread 1078020896 (LWP 29834)]
> [New Thread 1086450608 (LWP 29837)]
> [New Thread 1094839216 (LWP 29838)]
> [New Thread 1103227824 (LWP 29857)]
> backup-dir: dird.c:438 Director's configuration file reread.
> [Thread 1103227824 (LWP 29857) exited]
>
> [New Thread 1103227824 (LWP 30275)]
> backup-dir: dird.c:438 Director's configuration file reread.
> [Thread 1103227824 (LWP 30275) exited]
> [New Thread 1103227824 (LWP 30574)]
> [New Thread 1111620528 (LWP 30575)]
> [New Thread 1120074672 (LWP 30577)]
> [New Thread 1128463280 (LWP 30578)]
> [New Thread 1136851888 (LWP 30580)]
> [New Thread 1145240496 (LWP 30581)]
> [New Thread 1153629104 (LWP 30582)]
> [New Thread 1162017712 (LWP 30644)]
>
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 1078020896 (LWP 29834)]
> 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> (gdb) thread apply all bt
>
> Thread 13 (Thread 1162017712 (LWP 30644)):
> #0  0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/tls/libpthread.so.0
> #1  0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231
> #2  0x0808c8d2 in lock_jcr_chain () at jcr.c:544
> #3  0x0808bd56 in new_jcr (size=1162017184, daemon_free_jcr=0xfffffffc)
> at jcr.c:218
> #4  0x0807458c in new_control_jcr (base_name=0xfffffffc <Address
> 0xfffffffc out of bounds>, job_type=-4)
>     at ua_server.c:90
> #5  0x0807468e in handle_UA_client_request (arg=0x80e9d60) at
> ua_server.c:122
> #6  0x0809e4db in workq_server (arg=0x80c5920) at workq.c:347
> #7  0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #8  0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 12 (Thread 1153629104 (LWP 30582)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x080c5b80 in jobs ()
> #3  0x00000000 in ?? ()
> #4  0x00000001 in ?? ()
> #5  0x00000001 in ?? ()
> #6  0x00000000 in ?? ()
> #7  0x44c2fad8 in ?? ()
> #8  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #9  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 11 (Thread 1145240496 (LWP 30581)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x080c5b80 in jobs ()
> #3  0x00000000 in ?? ()
> #4  0x00000001 in ?? ()
> #5  0x00000001 in ?? ()
> #6  0x00000000 in ?? ()
> #7  0x4442fad8 in ?? ()
> #8  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #9  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 10 (Thread 1136851888 (LWP 30580)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x080c5b80 in jobs ()
> #3  0x00000000 in ?? ()
> #4  0x00000001 in ?? ()
> #5  0x00000001 in ?? ()
> #6  0x00000000 in ?? ()
> #7  0x43c2fad8 in ?? ()
> #8  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #9  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 9 (Thread 1128463280 (LWP 30578)):
> #0  0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/tls/libpthread.so.0
> #1  0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231
> #2  0x0808c8d2 in lock_jcr_chain () at jcr.c:544
> #3  0x0805bea4 in jobq_server (arg=0x80c57a0) at jobq.c:582
> #4  0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #5  0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 8 (Thread 1120074672 (LWP 30577)):
> #0  0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0
> #1  0x08084d4c in read_nbytes (bsock=0x80e1140, ptr=0x42c2f82c "@",
> nbytes=4) at bnet.c:72
> #2  0x08085067 in bnet_recv (bsock=0x80e1140) at bnet.c:175
> #3  0x08055d88 in bget_dirmsg (bs=0x80e1140) at getmsg.c:79
> #4  0x0805e508 in msg_thread (arg=0x80dcc48) at msgchan.c:235
> #5  0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #6  0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 7 (Thread 1111620528 (LWP 30575)):
> #0  0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0
> #1  0x08084d4c in read_nbytes (bsock=0x80e5f20,
>     ptr=0x4241f08c "9Q\b\bHÌ\r\b _\016\bXòAB\210]\005\b
> [EMAIL PROTECTED]<@[EMAIL PROTECTED]
> ", nbytes=4) at bnet.c:72
> #2  0x08085067 in bnet_recv (bsock=0x80e5f20) at bnet.c:175
> #3  0x08055d88 in bget_dirmsg (bs=0x80e5f20) at getmsg.c:79
> #4  0x0804daf8 in wait_for_job_termination (jcr=0x80dcc48) at
> backup.c:243
> #5  0x0804da23 in do_backup (jcr=0x80dcc48) at backup.c:207
> #6  0x08058946 in job_thread (arg=0x80dcc48) at job.c:215
> #7  0x0805c08a in jobq_server (arg=0x80c57a0) at jobq.c:444
> #8  0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #9  0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 6 (Thread 1103227824 (LWP 30574)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x080c5b80 in jobs ()
> #3  0x080c70b8 in ?? ()
> #4  0x00000001 in ?? ()
> #5  0x00000001 in ?? ()
> #6  0x00000000 in ?? ()
> #7  0x41c1ead8 in ?? ()
> #8  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #9  0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 3 (Thread 1094839216 (LWP 29838)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x080c5b80 in jobs ()
> #3  0x00000000 in ?? ()
> #4  0x00000000 in ?? ()
> #5  0x080e8f50 in ?? ()
> #6  0x080e8f60 in ?? ()
> #7  0x4141ea58 in ?? ()
> #8  0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581
> #9  0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Recursive call -- not in source code.


> #10 0x080590c8 in job_monitor_watchdog (self=0x80c5b80) at job.c:386
> #11 0x0809dad6 in watchdog_thread (arg=0x0) at watchdog.c:257
> #12 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #13 0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 2 (Thread 1086450608 (LWP 29837)):
> #0  0x4036ca27 in select () from /lib/tls/libc.so.6
> #1  0x080877e0 in bnet_thread_server (addrs=0x40c1eb90,
> max_clients=-514, client_wq=0x80c5920,
>     handle_client_request=0xfffffdfe) at bnet_server.c:154
> #2  0x08074569 in connect_thread (arg=0xfffffdfe) at ua_server.c:79
> #3  0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
> #4  0x4037318a in clone () from /lib/tls/libc.so.6
>
> Thread 1 (Thread 1078020896 (LWP 29834)):
> #0  0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x401a3893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0x00000006 in ?? ()
> #3  0x00000069 in ?? ()
> #4  0x00000005 in ?? ()
> #5  0x000000d1 in ?? ()
> #6  0xffffffff in ?? ()
> #7  0x080e8f50 in ?? ()
> #8  0xbffff958 in ?? ()
> #9  0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0x0) at jobq.c:240
> #10 0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0xffffffff) at jobq.c:240
> #11 0x080585fb in run_job (jcr=0x80e8f50) at job.c:140
> #12 0x0804b376 in main (argc=135171920, argv=0x80a0a58) at dird.c:241
>
> I could run bacula-sd and bacula-fd on the client paris (at which
> usually the jobs stop) under the gdb, too (now, that I have the debug
> binaries available).
>
> Regards
> Volker

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to