Hi,

I've got a Bacula setup that has hung on me a couple of time recently. I
am ready and willing to pursue the problem, but would like to know if
someone else is already working on it.

I did check the bug reports and scanned the email archives, but did not
find anything that looked like this problem.

Configuration:
Bacula 1.38.1
Director and Storage running on Scientific Linux 4.1
Clients running on Scientific Linux 4.1, RedHat 9, Mandrake 10.2, and
Windows XP
Some clients have fixed IP addresses, some use DHCP

Backup schedule:
Each day run full, differential, or incremental on all clients, then
backup catalog. All backups going to disk files.

Problem:
Occasionally the catalog backup does not run. I get an "intervention
needed" email message saying it is "waiting to reserve a device." When
this happens all job processing is halted. The only way I've found to
get jobs running again is to cancel all the pending jobs and restart the
Director.

A pattern I've noticed to these occurrences is that before the hang the
backup job for one of the clients using DHCP failed because the machine
had been turned off, but the indication Bacula got that the machine was
off was "ERROR bnet.c:767 gethostbyname() for host ... failed:
ERR=Autoritative answer for host not found." (On other instances when
one of these machine was turned off the message is "WARNING bnet.c:852
could not connect to file daemon on ...:9102 ERR=No route to host." When
that happens that particular job is given an "Error" completion status
and all remaining jobs run as scheduled.)

My hypothesis is that the difference in responses when trying to connect
to one of these machines is due to whether or not the DHCP lease has
expired. If it has not, the DNS record is still there so gethostbyname
succeeds but the attempt to connect to the file daemon fails. If the
lease has expired, the DNS record is no longer there and the
gethostbyname fails.

A second part to my hypothesis is that somewhere in the handling of the
gethostbyname failure some resource needed for the catalog backup is not
being released.

Is this a problem that anyone else has encountered?
Is someone already working on it?
Any corrections to my guesses or other hints before I go digging into
the sources?

Thanks,
Stuart Griffith
Rodgers Instruments, LLC




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to