Hello, I've been thinking about possible causes of "spurious" connection drops and how to debug them.
As I have noted a number of times, the most likely culprit is a bad network (in particular switches or ethernet cards that have bad firmware -- several cases such as this are documented in the manual). Another cause of disconnects are HP printers, which illegally use port 9100, which is OK, but under certain conditions they will sometimes broadcast on higher port numbers such as 9101, 9102, and 9103, which are registered to and used by Bacula. If you think it might be HP network printers (mine peacefully co-exists with Bacula), you can always move the HP port number, or move the Bacula ports, or turn off the printer(s) for a few days overnight while your backups run. However, another possibility is that Bacula (say the FD or the SD) detects an internal error or a logic error in the data received on the comm line. In that case, it is very likely it will abruptly hang up. In doing so, the daemon will always generate an error message (assuming there is no bug). The problem comes in delivering the message because normally all messages from the FD and SD are delivered back to the Director and then dispatched according to the Director's message rules. The problem is that it isn't always possible to deliver the message (timing problems, or the error concerned the connection with the Director), and in that case, the message will be lost and a "spurious" hang up will be the only visible sign. So, how to fix this. There are several ways, all involve changing the SD and FD's Messages resources to direct the error messages to a file, via email, or to the system log in addition to sending them to the Director. I'd suggest in cases where there are unexplained drops, you direct all messages to a file on both the FD and the SD. For example, a typical FD Messages resource looks like this: # Send all messages except skipped files back to Director Messages { Name = Standard director = rufus-dir = all, !skipped, !restored } I'd suggest changing it to: Messages { Name = Standard director = rufus-dir = all, !skipped, !restored append = "<working-dir>/log" = all, !skipped, !restored } where you change <working-dir> to be the path of the working directory used by the particular daemon. Then when a spurious connection drop occurs, perhaps there will be a message in the log explaining the reason for the drop. If you implement the above, to avoid filling your disk with log messages, be sure to remove it sometime later or implement the logrotate code that is distributed in <bacula-source>/scripts/logrotate. Best regards, Kern ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users