Hello,
It appears that TLS is getting stuck indefinitely in a read because of some
networking error.
You might try applying the attached patch. There is a good chance that it
will break the SD out of this condition.
Apply the patch with:
cd <bacula-source>
patch -p2 <3.0.3-tls-stall.patch
./configure <your-options>
make
...
make install
Feedback would be appreciated.
Regards,
Kern
On Thursday 19 November 2009 10:07:00 Christian Gaul wrote:
> Amongst many other clients, i backup my workstation using bacula (in
> this case 3.0.3, but i've been seeing this since i started using bacula
> with version 2.2 something).
>
> I can see the job for my client in the director, it is in the status
> "Waiting for client XXX to connect to storage YYY", and it has been in
> that status since i turned it off (around 13 hours ago). I am unable to
> cancel the job, because it is not running or scheduled and none of the
> other jobs on the director were able to start, they are all "waiting for
> execution" and older jobs have been canceled (thanks for fixing the
> canceled email notification with 3.0.3 btw) which means that, on this
> director, i have not had nightly backups run on any of my clients, on
> any of my SDs because a single client got turned off inbetween the
> director initializing the job and the client making the connection to
> the SD.
>
> I've been seeing this behavior, as i said, for a really long time now,
> and it has caused me enough grief to set up a second director / SDs and
> even two FDs per client. A single client, lets say a broken one, one
> being turned off or a malicious one, can bring a whole director to a
> halt. Is there some magic timeout value that is set to a (useless)
> default value that i am missing, or is it rather non concurrent
> connection creation that is blocking all my other jobs?
>
> I can leave the director in this state for a couple hours to perform
> magic incantations (stacktrace, backtrace etc) if you want any
> information about this issue.
>
> Ill attach the btraceback right away, also the last log lines.. but
> since i am not running this director for testing, it isnt running under
> any debug levels.
>
> After reviewing the bconsole output to make it postable, it seems that
> some jobs did run after 18:03 (the time i turned off my workstation),
> the last job ran (to a different SD than the one that blocked) at 02:30,
> after that, no new jobs, even to different SDs, could start.
>
> I really appreciate the work you guys are doing on bacula and i would
> love it if someone would take a look at this.
diff --git a/bacula/src/lib/tls.c b/bacula/src/lib/tls.c
index f5f0623..62a1ecf 100644
--- a/bacula/src/lib/tls.c
+++ b/bacula/src/lib/tls.c
@@ -558,15 +558,22 @@ void tls_bsock_shutdown(BSOCK *bsock)
*/
int err;
+ btimer_t *tid;
+
/* Set socket blocking for shutdown */
bsock->set_blocking();
+ tid = start_bsock_timer(bsock, 60 * 2);
err = SSL_shutdown(bsock->tls->openssl);
+ stop_bsock_timer(tid);
if (err == 0) {
/* Complete shutdown */
+ tid = start_bsock_timer(bsock, 60 * 2);
err = SSL_shutdown(bsock->tls->openssl);
+ stop_bsock_timer(tid);
}
+
switch (SSL_get_error(bsock->tls->openssl, err)) {
case SSL_ERROR_NONE:
break;
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel