Re: [Bacula-users] Watchdog timer killed long-running backup

Kern Sibbald Tue, 06 Mar 2007 01:00:04 -0800

On Monday 05 March 2007 23:57, Alan Davis wrote:
> I understand the sanity check - but the job wasn't idle - the FD and SD
> were both working and data was being written to tapes as expected for 6
> days.
>
> Would the director not know that the job was running and just assume
> that no job could take longer than the hard-coded timeout?


I don't know the answer to that question -- I suggest you look at the code.  
It should be looking at the socket use counts, but perhaps it does not.

My personal opinion is that any job that runs 6 days is totally insane.  You 
have about 0.000001% chance of ever being able to restore from it, and/or use 
it as a basis for additional Incremental/Differential backups.  Also the data 
on that backup (IMO) is not valid unless the machine was idle for those 6 
days.

IMO, you need to re-think how you are doing backups.  If that doesn't appeal 
to you, you can always increase the timeout, but again IMO, you are just 
heading for trouble later.

>
> The message seemed to indicate that the director was trying to talk to
> the FD but couldn't, or was expecting a response to the mount that it
> never got.
>
>
> ----
> Alan Davis
> Senior Architect
> Ruckus Network, Inc.
> 703.464.6578 (o)
> 410.365.7175 (m)
> [EMAIL PROTECTED]
> alancdavis AIM
>
> > -----Original Message-----
> > From: Kern Sibbald [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 05, 2007 2:55 PM
> > To: bacula-users@lists.sourceforge.net
> > Cc: Alan Davis
> > Subject: Re: [Bacula-users] Watchdog timer killed long-running backup
> >
> > This is not a bug, but rather an insanity check.  If you want to have
>
> idle
>
> > jobs remain in the system longer, take a looks at src/lib/watchdog.c
>
> --
>
> > someplace in that file there should be a tag that sets the timeout,
>
> which
>
> > you
> > can make longer as you wish.
> >
> > On Monday 05 March 2007 20:35, Alan Davis wrote:
> > > I was running a very large archival backup and about 20 hours into
>
> the
>
> > > backup I ran out of tapes that had the recycle flag set. I updated
>
> the
>
> > > flags and purged the first tape. The system then loaded the next
>
> tape
>
> > > and continued the backup. The SD (or FD), however, never signaled
>
> the
>
> > > DIR that the job had resumed and it stayed in "waiting for
>
> appendable
>
> > > Volume" (JS_WaitMedia) for 518415 secs (6 days) and then the DIR
>
> killed
>
> > > the job with the messages:
> > >
> > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Error:
> > > Watchdog sending kill after 518415 secs to thread stalled reading
>
> File
>
> > > daemon.
> > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal
>
> error:
> > > Network error with FD during Backup: ERR=Interrupted system call
> > > 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal
>
> error:
> > > No Job status returned from FD.
> > >
> > > The SD, FD and DIR are all running on the same node so network
>
> problems
>
> > > between them did not cause the timeout.
> > >
> > > The wait status seems to come from the SD and is reported by the
>
> DIR,
>
> > > but the kill message from the DIR indicates that not being able to
> > > communicate with the FD was the reason it killed the job.
> > >
> > > I've looked at some of the code and the best candidate that I've
>
> found
>
> > > so far for where a problem might cause this is in
> > > filed/heartbeat.c:sd_heartbeat_thread or somewhere in the
>
> acquire/mount
>
> > > code that a message isn't being sent back to the DIR.
> > >
> > > Due to the long runtime of the backup it's not practical for me to
>
> try
>
> > > to duplicate the problem exactly. I will try to create a reproducer
>
> with
>
> > > a smaller backup set once I have the archive backup completed.
> > >
> > > Any insight on the possible cause(s) would be greatly appreciated.
> > >
> > >
> > > ----
> > > Alan Davis
> > > Senior Architect
> > > Ruckus Network, Inc.
> > > 703.464.6578 (o)
> > > 410.365.7175 (m)
> > > [EMAIL PROTECTED]
> > > alancdavis AIM
>
> ------------------------------------------------------------------------
>
> > -
> >
> > > Take Surveys. Earn Cash. Influence the Future of IT
> > > Join SourceForge.net's Techsay panel and you'll get the chance to
>
> share
>
> > > your opinions on IT & business topics through brief surveys-and earn
> >
> > cash
>
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
> V
>
> > > _______________________________________________
> > > Bacula-users mailing list
> > > Bacula-users@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/bacula-users
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Watchdog timer killed long-running backup

Reply via email to