On Monday 05 March 2007 23:57, Alan Davis wrote: > I understand the sanity check - but the job wasn't idle - the FD and SD > were both working and data was being written to tapes as expected for 6 > days. > > Would the director not know that the job was running and just assume > that no job could take longer than the hard-coded timeout?
I don't know the answer to that question -- I suggest you look at the code. It should be looking at the socket use counts, but perhaps it does not. My personal opinion is that any job that runs 6 days is totally insane. You have about 0.000001% chance of ever being able to restore from it, and/or use it as a basis for additional Incremental/Differential backups. Also the data on that backup (IMO) is not valid unless the machine was idle for those 6 days. IMO, you need to re-think how you are doing backups. If that doesn't appeal to you, you can always increase the timeout, but again IMO, you are just heading for trouble later. > > The message seemed to indicate that the director was trying to talk to > the FD but couldn't, or was expecting a response to the mount that it > never got. > > > ---- > Alan Davis > Senior Architect > Ruckus Network, Inc. > 703.464.6578 (o) > 410.365.7175 (m) > [EMAIL PROTECTED] > alancdavis AIM > > > -----Original Message----- > > From: Kern Sibbald [mailto:[EMAIL PROTECTED] > > Sent: Monday, March 05, 2007 2:55 PM > > To: bacula-users@lists.sourceforge.net > > Cc: Alan Davis > > Subject: Re: [Bacula-users] Watchdog timer killed long-running backup > > > > This is not a bug, but rather an insanity check. If you want to have > > idle > > > jobs remain in the system longer, take a looks at src/lib/watchdog.c > > -- > > > someplace in that file there should be a tag that sets the timeout, > > which > > > you > > can make longer as you wish. > > > > On Monday 05 March 2007 20:35, Alan Davis wrote: > > > I was running a very large archival backup and about 20 hours into > > the > > > > backup I ran out of tapes that had the recycle flag set. I updated > > the > > > > flags and purged the first tape. The system then loaded the next > > tape > > > > and continued the backup. The SD (or FD), however, never signaled > > the > > > > DIR that the job had resumed and it stayed in "waiting for > > appendable > > > > Volume" (JS_WaitMedia) for 518415 secs (6 days) and then the DIR > > killed > > > > the job with the messages: > > > > > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Error: > > > Watchdog sending kill after 518415 secs to thread stalled reading > > File > > > > daemon. > > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal > > error: > > > Network error with FD during Backup: ERR=Interrupted system call > > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal > > error: > > > No Job status returned from FD. > > > > > > The SD, FD and DIR are all running on the same node so network > > problems > > > > between them did not cause the timeout. > > > > > > The wait status seems to come from the SD and is reported by the > > DIR, > > > > but the kill message from the DIR indicates that not being able to > > > communicate with the FD was the reason it killed the job. > > > > > > I've looked at some of the code and the best candidate that I've > > found > > > > so far for where a problem might cause this is in > > > filed/heartbeat.c:sd_heartbeat_thread or somewhere in the > > acquire/mount > > > > code that a message isn't being sent back to the DIR. > > > > > > Due to the long runtime of the backup it's not practical for me to > > try > > > > to duplicate the problem exactly. I will try to create a reproducer > > with > > > > a smaller backup set once I have the archive backup completed. > > > > > > Any insight on the possible cause(s) would be greatly appreciated. > > > > > > > > > ---- > > > Alan Davis > > > Senior Architect > > > Ruckus Network, Inc. > > > 703.464.6578 (o) > > > 410.365.7175 (m) > > > [EMAIL PROTECTED] > > > alancdavis AIM > > ------------------------------------------------------------------------ > > > - > > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to > > share > > > > your opinions on IT & business topics through brief surveys-and earn > > > > cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE > V > > > > _______________________________________________ > > > Bacula-users mailing list > > > Bacula-users@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/bacula-users > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users