After digging into the sources and running some experiments I have some corrections to make to my guesses.
The hang appears to be in the storage daemon, not the director. My best guess at this point for the scenario is: * A backup job is started for a client that is turned off using pool a on my file storage device. * As part of this the storage daemon reserves the device while waiting for the file daemon on the client. * The director then tries to contact the client, but gets a DNS look up failure which immediately terminates this backup job. * The next job (a catalog backup) is started which wants to use a different pool on the same file storage device. * The storage daemon can't give the file device to this catalog backup job because it's still waiting for a connection from the client. * When the 30 minute timeout expires the storage daemon releases the reservation, but doesn't give the file device to the waiting catalog backup job. I suspect that the hang doesn't happen when the DNS look up succeeds but the client doesn't respond because the delay while the director retries the connection to the client let the storage daemon's timeout expire before the next job is started. I have been able to reproduce this at will by setting up a client with a bogus network name and then from the console start a backup of that client followed by a catalog backup. I am continuing to try to both pinpoint the exact cause of this hang in the sources, and come up with a work around (a different device for my catalog backups?), but would appreciate any hints that might accelerate either of those efforts. Thanks, Stuart Griffith Rodgers Instruments, LLC > Hi, > > I've got a Bacula setup that has hung on me a couple of time recently. > I > am ready and willing to pursue the problem, but would like to know if > someone else is already working on it. > > I did check the bug reports and scanned the email archives, but did > not > find anything that looked like this problem. > > Configuration: > Bacula 1.38.1 > Director and Storage running on Scientific Linux 4.1 > Clients running on Scientific Linux 4.1, RedHat 9, Mandrake 10.2, and > Windows XP > Some clients have fixed IP addresses, some use DHCP > > Backup schedule: > Each day run full, differential, or incremental on all clients, then > backup catalog. All backups going to disk files. > > Problem: > Occasionally the catalog backup does not run. I get an "intervention > needed" email message saying it is "waiting to reserve a device." When > this happens all job processing is halted. The only way I've found to > get jobs running again is to cancel all the pending jobs and restart > the > Director. > > A pattern I've noticed to these occurrences is that before the hang > the > backup job for one of the clients using DHCP failed because the > machine > had been turned off, but the indication Bacula got that the machine > was > off was "ERROR bnet.c:767 gethostbyname() for host ... failed: > ERR=Autoritative answer for host not found." (On other instances when > one of these machine was turned off the message is "WARNING bnet.c:852 > could not connect to file daemon on ...:9102 ERR=No route to host." > When > that happens that particular job is given an "Error" completion status > and all remaining jobs run as scheduled.) > > My hypothesis is that the difference in responses when trying to > connect > to one of these machines is due to whether or not the DHCP lease has > expired. If it has not, the DNS record is still there so gethostbyname > succeeds but the attempt to connect to the file daemon fails. If the > lease has expired, the DNS record is no longer there and the > gethostbyname fails. > > A second part to my hypothesis is that somewhere in the handling of > the > gethostbyname failure some resource needed for the catalog backup is > not > being released. > > Is this a problem that anyone else has encountered? > Is someone already working on it? > Any corrections to my guesses or other hints before I go digging into > the sources? > > Thanks, > Stuart Griffith > Rodgers Instruments, LLC > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users