Hi, I've got a Bacula setup that has hung on me a couple of time recently. I am ready and willing to pursue the problem, but would like to know if someone else is already working on it.
I did check the bug reports and scanned the email archives, but did not find anything that looked like this problem. Configuration: Bacula 1.38.1 Director and Storage running on Scientific Linux 4.1 Clients running on Scientific Linux 4.1, RedHat 9, Mandrake 10.2, and Windows XP Some clients have fixed IP addresses, some use DHCP Backup schedule: Each day run full, differential, or incremental on all clients, then backup catalog. All backups going to disk files. Problem: Occasionally the catalog backup does not run. I get an "intervention needed" email message saying it is "waiting to reserve a device." When this happens all job processing is halted. The only way I've found to get jobs running again is to cancel all the pending jobs and restart the Director. A pattern I've noticed to these occurrences is that before the hang the backup job for one of the clients using DHCP failed because the machine had been turned off, but the indication Bacula got that the machine was off was "ERROR bnet.c:767 gethostbyname() for host ... failed: ERR=Autoritative answer for host not found." (On other instances when one of these machine was turned off the message is "WARNING bnet.c:852 could not connect to file daemon on ...:9102 ERR=No route to host." When that happens that particular job is given an "Error" completion status and all remaining jobs run as scheduled.) My hypothesis is that the difference in responses when trying to connect to one of these machines is due to whether or not the DHCP lease has expired. If it has not, the DNS record is still there so gethostbyname succeeds but the attempt to connect to the file daemon fails. If the lease has expired, the DNS record is no longer there and the gethostbyname fails. A second part to my hypothesis is that somewhere in the handling of the gethostbyname failure some resource needed for the catalog backup is not being released. Is this a problem that anyone else has encountered? Is someone already working on it? Any corrections to my guesses or other hints before I go digging into the sources? Thanks, Stuart Griffith Rodgers Instruments, LLC ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users