This is not a bug, but rather an insanity check.  If you want to have idle 
jobs remain in the system longer, take a looks at src/lib/watchdog.c -- 
someplace in that file there should be a tag that sets the timeout, which you 
can make longer as you wish.

On Monday 05 March 2007 20:35, Alan Davis wrote:
> I was running a very large archival backup and about 20 hours into the
> backup I ran out of tapes that had the recycle flag set. I updated the
> flags and purged the first tape. The system then loaded the next tape
> and continued the backup. The SD (or FD), however, never signaled the
> DIR that the job had resumed and it stayed in "waiting for appendable
> Volume" (JS_WaitMedia) for 518415 secs (6 days) and then the DIR killed
> the job with the messages:
>
> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Error:
> Watchdog sending kill after 518415 secs to thread stalled reading File
> daemon.
> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal error:
> Network error with FD during Backup: ERR=Interrupted system call
> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal error:
> No Job status returned from FD.
>
> The SD, FD and DIR are all running on the same node so network problems
> between them did not cause the timeout.
>
> The wait status seems to come from the SD and is reported by the DIR,
> but the kill message from the DIR indicates that not being able to
> communicate with the FD was the reason it killed the job.
>
> I've looked at some of the code and the best candidate that I've found
> so far for where a problem might cause this is in
> filed/heartbeat.c:sd_heartbeat_thread or somewhere in the acquire/mount
> code that a message isn't being sent back to the DIR.
>
> Due to the long runtime of the backup it's not practical for me to try
> to duplicate the problem exactly. I will try to create a reproducer with
> a smaller backup set once I have the archive backup completed.
>
> Any insight on the possible cause(s) would be greatly appreciated.
>
>
> ----
> Alan Davis
> Senior Architect
> Ruckus Network, Inc.
> 703.464.6578 (o)
> 410.365.7175 (m)
> [EMAIL PROTECTED]
> alancdavis AIM
>
>
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to