Hello,

I've been thinking about possible causes of "spurious" connection drops and 
how to debug them.

As I have noted a number of times, the most likely culprit is a bad network 
(in particular switches or ethernet cards that have bad firmware -- several 
cases such as this are documented in the manual).  Another cause of
disconnects are HP printers, which illegally use port 9100, which is OK, but 
under certain conditions they will sometimes broadcast on higher port numbers 
such as 9101, 9102, and 9103, which are registered to and used by Bacula.  

If you think it might be HP network printers (mine peacefully co-exists with 
Bacula), you can always move the HP port number, or move the Bacula ports, or 
turn off the printer(s) for a few days overnight while your backups run.

However, another possibility is that Bacula (say the FD or the SD) detects an 
internal error or a logic error in the data received on the comm line. In 
that case, it is very likely it will abruptly hang up.  

In doing so, the daemon will always generate an error message (assuming there 
is no bug).  The problem comes in delivering the message because normally all 
messages from the FD and SD are delivered back to the Director and then 
dispatched according to the Director's message rules.  The problem is that it 
isn't always possible to deliver the message (timing problems, or the error 
concerned the connection with the Director), and in that case, the message 
will be lost and a "spurious" hang up will be the only visible sign.

So, how to fix this.  There are several ways, all involve changing the SD and 
FD's Messages resources to direct the error messages to a file, via email, or 
to the system log in addition to sending them to the Director.

I'd suggest in cases where there are unexplained drops, you direct all 
messages to a file on both the FD and the SD.  For example, a typical FD 
Messages resource looks like this:

# Send all messages except skipped files back to Director
Messages {
  Name = Standard
  director = rufus-dir = all, !skipped, !restored
}

I'd suggest changing it to:

Messages {
  Name = Standard
  director = rufus-dir = all, !skipped, !restored
  append = "<working-dir>/log" = all, !skipped, !restored
}

where you change <working-dir> to be the path of the working directory used by 
the particular daemon. 

Then when a spurious connection drop occurs, perhaps there will be a message 
in the log explaining the reason for the drop.  If you implement the above, 
to avoid filling your disk with log messages, be sure to remove it sometime 
later or implement the logrotate code that is distributed in 
<bacula-source>/scripts/logrotate.

Best regards,

Kern

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to