On Thu, 25 October, 2007 9:53 pm, Arno Lehmann wrote: > Hi, > > 25.10.2007 15:08,, GDS.Marshall wrote:: >> On Wed, 24 October, 2007 8:17 pm, Arno Lehmann wrote: >>> Hi, >>> >>> 24.10.2007 12:33,, GDS.Marshall wrote:: >>>> Hello, >>>> >>>>> Hi, >>>>> >>>>> 22.10.2007 21:26,, GDS.Marshall wrote:: >>>>>> version 2.2.4 patched from sourceforge >>>>>> Linux kernel 2.6.x >>>>>> >>>>>> I am running 10+ FD's, one SD, and one Director. I am having >>>>>> problems >>>>>> with one of my FD's, the others are fine. > ... >>>> FD+DIR FD FD >>>> | | | >>>> GSW---------------.... Gig Switch >>>> | >>>> FSW---------------.... Fast Switch >>>> | >>>> SD >>> And the problem connection is between the hosts to the left... ok. >> That is correct. >> >>> ... >>>>>> 22-Oct 18:56 backupserver-sd: Spooling data ... >>>>>> 22-Oct 18:56 fileserver-fd: fileserver-backup.2007-10-22_18.54.33 >>>>>> Fatal >>>>>> error: backup.c:892 Network send error to SD. ERR=Success >>>>> So the connection breaks shortly after data starts being transferred, >>>>> right? >>>> Correct, 2193816 is always written. >>> Funny. Disk full on the SD, perhaps? Might be worth a look into the >>> system log on both the machines. >> No, that was one of the first things I checked. The SD spool is a >> dedicated logical volume of 740Gigs (over two tapes of data). All FD's >> write to the same spool. When the schedule runs the job, it is not on >> its >> own, however, when I have been running it by hand, then it is the only >> job >> running. > > So we can be more or less sure it's got to do with the scheduling process. > > ... >>> Good enough... regarding network problems, you could try to enable the >>> heartbeat function in the FD and / or SD. To find the cause of the >>> problem, tcpdump or wireshark might help. >> I read about heart beat with the 3com issue, and switched it on for both >> the FD and SD. I have not tried tcpdump or wireshark, will give it a >> go. > > Use the filtering options extensively - otherwise, you will be > overloaded by the output :-) > >>> If you see RST packages on the connection between FD and SD it's only >>> the question who generates them... >>> >>> ... >>>>> Here it's failed, I think. A higher debug level might reveal more, >>>>> but >>>>> this doesn't tell me anything important. >>>> I am probably going to get flamed for this, >>> Not by me :-) >>> >>>> but what value, currently it >>>> is set to 200, I do not want to put it too high, and swamp the amount >>>> of >>>> data I am supplying the mailing list, but neither do I want to waste >>>> the >>>> mailing lists time by making it too low.... >>> Really a difficult question :-) >>> >>> The best approach might be to run with debug level 400, save the >>> resulting logs, and only post the part around the failure first. If >>> someone needs more detail, you could post the complete log to a web >>> site. >> >> Okay, will give 400 a go. >> >>> ... >>>>>> backupserver ~ # >>>>> With the information from above, I suspect a network problem. Does >>>>> the >>>>> client run before job you have run for a very long time? In such a >>>>> situation, a firewall/router might close the connection between SD >>>>> and >>>>> FD because it seems to be idle. >>>> The run before job might take half an hour max. There is no firewall >>>> or >>>> router in the setup. >>> Hmm... half an hour should not trigger a RST due to idleing too long. >>> Do your other FDs on the network segment with the DIR have >>> long-running scripts, too, or do they transfer data almost immediately >>> after the backup jobs are started? >> This is the only one with a script. Surely if it has started to >> transfer >> data, the RST will not take place as it it no longer idle (just a >> thought). > > Well, it might happen that some device or software decides to drop > that connection earlier, but only sends RST packets when the > connection is (according to its assumptions) invalid. This would be a > behaviour often found in routers, I believe. > > You could try to run that same job with a dummy "Client Run Before" > script which immediately exits, just to see what happens then. > > If this case works, and the heartbeat doesn't, then it's surely time > for some network debugging, I think. I think I have found the problem, the equipment is on a gig network, and should be able to handle a bigger network buffer than 65536, as a result, it had been set manually Maximum Network Buffer Size = 16777216 this FD is the only one to have that set. The SD also has this set. If I remove it from the FD, it works fine. Why it stopped working all of a sudden, I do not know.
Regards, Spencer ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users