Thank you. This will be useful for others seeking the same solution. On Oct 22, 2012, at 7:47 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC] wrote:
> I changed the timeout value from 6 days to 60 days in src/lib/bnet.c and > bsoc.c. I also added the "Heartbeat Interval = 120" in bacula-dir.conf, > bacula-sd.conf, bacula-fd.conf and bconsole.conf. > > bsock->timeout = 60 * 60 * 60 * 24; /* 60 days timeout */ > > I re-compiled bacula and ran a full backup of 26TB. It completed successfully > after 9 days. > > Thank you all for your help. > > Uthra > > -----Original Message----- > From: Dan Langille [mailto:d...@langille.org] > Sent: Thursday, October 11, 2012 9:49 AM > To: Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC] > Cc: Martin Simmons; bacula-users@lists.sourceforge.net > Subject: Re: [Bacula-users] bacula watchdog sending kill > > On 2012-10-11 08:41, Martin Simmons wrote: >>>>>>> On Wed, 10 Oct 2012 19:15:55 -0400, Dan Langille said: >>> >>> On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS >>> INC] wrote: >>> >>>> I have bacula 5.2.10 installed on a RHEL 6 server and it has been >>> running fine but recently we have bumped in to a problem. I am >>> backing up our data server which is about 26TB. I started a Full >>> backup up of this machine and the backup ran for 6 days and then the >>> process is killed by Watchdog. Here is the information I got from the >>> bconsole: >>>> >>>> 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size >>> reached. >>>> 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume. >>> Despooling 966,367,832,548 bytes ... >>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill >>> after 518406 secs to thread stalled reading File daemon. >>> >>> Yes, that's 6 days (as mentioned below), or close to it: 518400... >>> >>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with >>> FD during Backup: ERR=Interrupted system call >>>> 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal >>> append error on device "Drive-1" (/dev/nst0): ERR= >>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status >>> returned from FD. >>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10 >>> (28Jun12): >>>> Build OS: x86_64-unknown-linux-gnu redhat >>> Enterprise release >>>> >>>> >>>> I read about “Max Run Time = time” directive that could be set in >>> the bacula config file. I also read that By default, the watchdog >>> thread will kill any Job that has run more than 6 days. The maximum >>> watchdog timeout is independent of MaxRunTime and cannot be changed?? >>> >>> Yes, I am sure that is correct. >>> >>>> I am not sure if I should set this directive in my bacula config >>> file? Has anybody encountered this issue if so how did you solve this >>> problem? >>>> >>>> I would appreciate your help. >>>> >>> >>> If I recall correctly, you need to make a code change, and recompile. >>> It is a simple patch, and has been posted to this list (or at least >>> referred to on this list in the past month. Search for 'Watchdog >>> sending kill' and see what you find. >>> >>> Oh wait, you're with NASA. OK, here goes. I like marc.info >>> archives: http://marc.info/?l=bacula-users >>> >>> I found the reference I was thinking of: >>> http://marc.info/?l=bacula-users&m=134237429312031&w=2 >>> >>> I think this is what they were referring to: >>> http://marc.info/?l=bacula-users&m=131707949318181&w=2 >>> >>> and it looks like src/lib/watchdog.c is your friend. I looked at >>> that code, but couldn't figure out a solution. And now I'm out of >>> time. Sorry. >> >> This 6 days timeout is in src/lib/bnet.c I think (see init_bsock). > > Thank you. > > Found it. > > Look for this: > > /* > * ****FIXME**** reduce this to a few hours once > * heartbeats are implemented > */ > bsock->timeout = 60 * 60 * 6 * 24; /* 6 days timeout */ > > > Bump up the timeout value, recompile, and you're good to go. > > > > > -- > Dan Langille - http://langille.org/ -- Dan Langille - http://langille.org ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users