After digging into the sources and running some experiments I have some
corrections to make to my guesses.

The hang appears to be in the storage daemon, not the director.

My best guess at this point for the scenario is:
* A backup job is started for a client that is turned off using pool a
on my file storage device.
* As part of this the storage daemon reserves the device while waiting
for the file daemon on the client.
* The director then tries to contact the client, but gets a DNS look up
failure which immediately terminates this backup job.
* The next job (a catalog backup) is started which wants to use a
different pool on the same file storage device.
* The storage daemon can't give the file device to this catalog backup
job because it's still waiting for a connection from the client.
* When the 30 minute timeout expires the storage daemon releases the
reservation, but doesn't give the file device to the waiting catalog
backup job.

I suspect that the hang doesn't happen when the DNS look up succeeds but
the client doesn't respond because the delay while the director retries
the connection to the client let the storage daemon's timeout expire
before the next job is started.

I have been able to reproduce this at will by setting up a client with a
bogus network name and then from the console start a backup of that
client followed by a catalog backup.

I am continuing to try to both pinpoint the exact cause of this hang in
the sources, and come up with a work around (a different device for my
catalog backups?), but would appreciate any hints that might accelerate
either of those efforts.

Thanks,
Stuart Griffith
Rodgers Instruments, LLC

> Hi,
> 
> I've got a Bacula setup that has hung on me a couple of time recently.
> I
> am ready and willing to pursue the problem, but would like to know if
> someone else is already working on it.
> 
> I did check the bug reports and scanned the email archives, but did
> not
> find anything that looked like this problem.
> 
> Configuration:
> Bacula 1.38.1
> Director and Storage running on Scientific Linux 4.1
> Clients running on Scientific Linux 4.1, RedHat 9, Mandrake 10.2, and
> Windows XP
> Some clients have fixed IP addresses, some use DHCP
> 
> Backup schedule:
> Each day run full, differential, or incremental on all clients, then
> backup catalog. All backups going to disk files.
> 
> Problem:
> Occasionally the catalog backup does not run. I get an "intervention
> needed" email message saying it is "waiting to reserve a device." When
> this happens all job processing is halted. The only way I've found to
> get jobs running again is to cancel all the pending jobs and restart
> the
> Director.
> 
> A pattern I've noticed to these occurrences is that before the hang
> the
> backup job for one of the clients using DHCP failed because the
> machine
> had been turned off, but the indication Bacula got that the machine
> was
> off was "ERROR bnet.c:767 gethostbyname() for host ... failed:
> ERR=Autoritative answer for host not found." (On other instances when
> one of these machine was turned off the message is "WARNING bnet.c:852
> could not connect to file daemon on ...:9102 ERR=No route to host."
> When
> that happens that particular job is given an "Error" completion status
> and all remaining jobs run as scheduled.)
> 
> My hypothesis is that the difference in responses when trying to
> connect
> to one of these machines is due to whether or not the DHCP lease has
> expired. If it has not, the DNS record is still there so gethostbyname
> succeeds but the attempt to connect to the file daemon fails. If the
> lease has expired, the DNS record is no longer there and the
> gethostbyname fails.
> 
> A second part to my hypothesis is that somewhere in the handling of
> the
> gethostbyname failure some resource needed for the catalog backup is
> not
> being released.
> 
> Is this a problem that anyone else has encountered?
> Is someone already working on it?
> Any corrections to my guesses or other hints before I go digging into
> the sources?
> 
> Thanks,
> Stuart Griffith
> Rodgers Instruments, LLC
> 



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to