Hi,

On 3/5/2007 8:35 PM, Alan Davis wrote:
> I was running a very large archival backup and about 20 hours into the
> backup I ran out of tapes that had the recycle flag set. I updated the
> flags and purged the first tape. The system then loaded the next tape
> and continued the backup.

Good, and expected, so far...

> The SD (or FD), however, never signaled the
> DIR that the job had resumed and it stayed in "waiting for appendable
> Volume" (JS_WaitMedia) for 518415 secs (6 days) and then the DIR killed
> the job with the messages:

Kern once stated that there was a hard-coded limit to the run or wait 
time which was about six days, IIRC. That would fit.

> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Error:
> Watchdog sending kill after 518415 secs to thread stalled reading File
> daemon.
> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal error:
> Network error with FD during Backup: ERR=Interrupted system call 
> 04-Mar 17:17 gannon-dir: LiveArchiveJob.2007-02-26_17.16.43 Fatal error:
> No Job status returned from FD.

Looks like the FD did not send any more data.

> The SD, FD and DIR are all running on the same node so network problems
> between them did not cause the timeout.

An (almost) safe bet... these problems could occur on one singe host, 
but that seems unlikely. I assume you checked that there was no 
firewalling taking place. The point is that you probably used the 
network IP address or the hostname of that machine, (and no localhost), 
so it might be possible that Baculas connection was trapped by a local 
firewall. Using localhost as a client / storage / dir address is 
possible, but would require a sophisticated setup if you also want to 
run backups from or to other hosts.

> The wait status seems to come from the SD and is reported by the DIR,
> but the kill message from the DIR indicates that not being able to
> communicate with the FD was the reason it killed the job. 

Yes, as far as I know the SD waits for data from the FD, and that 
waiting took too long.

> I've looked at some of the code and the best candidate that I've found
> so far for where a problem might cause this is in
> filed/heartbeat.c:sd_heartbeat_thread or somewhere in the acquire/mount
> code that a message isn't being sent back to the DIR.
> 
> Due to the long runtime of the backup it's not practical for me to try
> to duplicate the problem exactly.

Understandable.

> I will try to create a reproducer with
> a smaller backup set once I have the archive backup completed.

Probably it would be sufficient to start a job, make sure its stalled 
(for example by not having volumes for the pool required) and then wait. 
I'd suggest waiting for at least two hours as that would be the most 
common network timeout.

Then, allow the SD to continue.

I'd use a disk based pool for such tests, with volumes limited to a 
small size, and without allowing auto-creation of volumes.

> Any insight on the possible cause(s) would be greatly appreciated.

Which versions did you run?

Anyway, good luch tracking this down - these long-running jobs can 
really be a pain...

Arno


> 
> ----
> Alan Davis
> Senior Architect
> Ruckus Network, Inc.
> 703.464.6578 (o)
> 410.365.7175 (m)
> [EMAIL PROTECTED]
> alancdavis AIM
>  
> 
> 
> 
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to