On Tuesday 06 March 2012 14:44:33 Silver Salonen wrote: > On Tuesday 28 February 2012 16:07:50 Christopher Hylarides wrote: > > I'm not sure why (I haven't had the need to dig this deep), but with > > large backups (well all of them, really) bacula-dir connects to the FD, > > then the FD starts doing stuff while the DIR still maintains the > > connection. So it could be timing out after half an hour and then later > > when the DIR tries to write again it fails. > > > > This is why i tuned the TCP *keepalive* to 15 seconds from the solaris > > default of 2 hours. This is exactly what happened to me. I'd start a > > large backup, and without question if failed at 2.5 hours. > > > > See also: > > http://leaf.dragonflybsd.org/mailarchive/commits/2008-03/msg00166.html > > > > You what you probably want to do is forcefully enable tcp keepalives and > > have them go every minute or so. It may not even be your firewall > > timing out. My machines were on the same LAN. > > I set TCP keepalive to 15 seconds on my Bacula server, but it did not change > a thing. > > Additionally I downgraded Bacula server to 5.0, but fortunately it seems it > did not help either (meaning the problem is not a regression in 5.2). > > I was able to solve some problems though. > We have multiple clients in the same environment, but in different VLANs, all > being behind pfSense firewall. Before DIR connected to clients through > external addresses and "port reflection" (whatever it means in pfSense). When > I changed external addresses to internal ones, the DIR--FD timeouts are gone. > > So I guess the remaining FD--SD timeouts are somehow caused by pfSense > firewall too. I'll keep digging. > > PS. Please post your replies below of the quoted text in mailing lists :)
So I've confirmed that what is to blame here is pfSense's port reflection. >From forums I've found that supposedly it means that the port redirection is >done with netcat instead of PF (which is really hackish, even to pfSense's >developers' minds) and netcat's TCP-timeout is 2000s by default. And it seems >to be not possible to disable the timeout. What is still not clear to me is why does DIR have to keep up the DIR--FD connection while FD is sending its data to SD. But well, at least the issue is worked around now. -- Silver > > On 12-02-27 10:23 AM, Silver Salonen wrote: > > > On Monday 27 February 2012 09:29:13 Christopher Hylarides wrote: > > >> I had a similar issue that was solved by tweaking my TCP-keepalives at > > >> the kernel level that my director was on (in my case Solaris). > > >> > > >> My case was on a LAN, but with over 300GB. It would fail at exactly the > > >> same time. > > > > > > Hi. > > > > > > Thanks for the information. We use FreeBSD-based PF firewalls and all the > > > timeout values are on default in there and none of them is less than 15s: > > > > > > tcp.first 120s > > > tcp.opening 30s > > > tcp.established 86400s > > > tcp.closing 900s > > > tcp.finwait 45s > > > tcp.closed 90s > > > tcp.tsdiff 30s > > > > > > Any more guesses? May it be some hardware-related stuff? > > > > > > -- > > > Silver > > > > > >> > > >> On 12-02-25 9:21 AM, Silver Salonen wrote: > > >>> On Thu, 23 Feb 2012 10:49:55 -0500, Josh Fisher wrote: > > >>>> On 2/23/2012 4:11 AM, Silver Salonen wrote: > > >>>>> On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote: > > >>>>> > > >>>>> What's also interesting about these failures are these lines > > >>>>> (similar in all these failing jobs): > > >>>>> FD Files Written: 381 > > >>>>> SD Files Written: 0 > > >>>>> FD Bytes Written: 391,430,239 (391.4 MB) > > >>>>> SD Bytes Written: 0 (0 B) > > >>>>> Last Volume Bytes: 260 (260 B) > > >>>>> > > >>>>> And the actual volume file seems to contain all the data (its size > > >>>>> is 373MB). > > >>>>> > > >>>>> What can we conclude from that? > > >>>>> Does the failure/timeout/whatever occur after the FD--SD connection, > > >>>>> eg. when SD tries to communicate with DIR about the end of the job or > > >>>>> smth? > > >>>> > > >>>> Or does the Dir abort the job after a timeout/whatever occurs for the > > >>>> Dir->FD connection? Since the problem started after changing network > > >>>> environment, I suspect a switch or router is timing out the Dir->FD > > >>>> connection, perhaps when the FD is busy compressing a large file or > > >>>> something. Try turning compression off? Just a guess. > > >>> > > >>> Tried it. No changes :( > > >>> > > >>> -- > > >>> Silver ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users