Hi, 25.10.2007 15:08,, GDS.Marshall wrote:: > On Wed, 24 October, 2007 8:17 pm, Arno Lehmann wrote: >> Hi, >> >> 24.10.2007 12:33,, GDS.Marshall wrote:: >>> Hello, >>> >>>> Hi, >>>> >>>> 22.10.2007 21:26,, GDS.Marshall wrote:: >>>>> version 2.2.4 patched from sourceforge >>>>> Linux kernel 2.6.x >>>>> >>>>> I am running 10+ FD's, one SD, and one Director. I am having problems >>>>> with one of my FD's, the others are fine. ... >>> FD+DIR FD FD >>> | | | >>> GSW---------------.... Gig Switch >>> | >>> FSW---------------.... Fast Switch >>> | >>> SD >> And the problem connection is between the hosts to the left... ok. > That is correct. > >> ... >>>>> 22-Oct 18:56 backupserver-sd: Spooling data ... >>>>> 22-Oct 18:56 fileserver-fd: fileserver-backup.2007-10-22_18.54.33 >>>>> Fatal >>>>> error: backup.c:892 Network send error to SD. ERR=Success >>>> So the connection breaks shortly after data starts being transferred, >>>> right? >>> Correct, 2193816 is always written. >> Funny. Disk full on the SD, perhaps? Might be worth a look into the >> system log on both the machines. > No, that was one of the first things I checked. The SD spool is a > dedicated logical volume of 740Gigs (over two tapes of data). All FD's > write to the same spool. When the schedule runs the job, it is not on its > own, however, when I have been running it by hand, then it is the only job > running.
So we can be more or less sure it's got to do with the scheduling process. ... >> Good enough... regarding network problems, you could try to enable the >> heartbeat function in the FD and / or SD. To find the cause of the >> problem, tcpdump or wireshark might help. > I read about heart beat with the 3com issue, and switched it on for both > the FD and SD. I have not tried tcpdump or wireshark, will give it a go. Use the filtering options extensively - otherwise, you will be overloaded by the output :-) >> If you see RST packages on the connection between FD and SD it's only >> the question who generates them... >> >> ... >>>> Here it's failed, I think. A higher debug level might reveal more, but >>>> this doesn't tell me anything important. >>> I am probably going to get flamed for this, >> Not by me :-) >> >>> but what value, currently it >>> is set to 200, I do not want to put it too high, and swamp the amount of >>> data I am supplying the mailing list, but neither do I want to waste the >>> mailing lists time by making it too low.... >> Really a difficult question :-) >> >> The best approach might be to run with debug level 400, save the >> resulting logs, and only post the part around the failure first. If >> someone needs more detail, you could post the complete log to a web site. > > Okay, will give 400 a go. > >> ... >>>>> backupserver ~ # >>>> With the information from above, I suspect a network problem. Does the >>>> client run before job you have run for a very long time? In such a >>>> situation, a firewall/router might close the connection between SD and >>>> FD because it seems to be idle. >>> The run before job might take half an hour max. There is no firewall or >>> router in the setup. >> Hmm... half an hour should not trigger a RST due to idleing too long. >> Do your other FDs on the network segment with the DIR have >> long-running scripts, too, or do they transfer data almost immediately >> after the backup jobs are started? > This is the only one with a script. Surely if it has started to transfer > data, the RST will not take place as it it no longer idle (just a > thought). Well, it might happen that some device or software decides to drop that connection earlier, but only sends RST packets when the connection is (according to its assumptions) invalid. This would be a behaviour often found in routers, I believe. You could try to run that same job with a dummy "Client Run Before" script which immediately exits, just to see what happens then. If this case works, and the heartbeat doesn't, then it's surely time for some network debugging, I think. Arno -- Arno Lehmann IT-Service Lehmann www.its-lehmann.de ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users