On Thu, 25 October, 2007 9:53 pm, Arno Lehmann wrote:
> Hi,
>
> 25.10.2007 15:08,, GDS.Marshall wrote::
>> On Wed, 24 October, 2007 8:17 pm, Arno Lehmann wrote:
>>> Hi,
>>>
>>> 24.10.2007 12:33,, GDS.Marshall wrote::
>>>> Hello,
>>>>
>>>>> Hi,
>>>>>
>>>>> 22.10.2007 21:26,, GDS.Marshall wrote::
>>>>>> version 2.2.4 patched from sourceforge
>>>>>> Linux kernel 2.6.x
>>>>>>
>>>>>> I am running 10+ FD's, one SD, and one Director.  I am having
>>>>>> problems
>>>>>> with one of my FD's, the others are fine.
> ...
>>>> FD+DIR   FD   FD
>>>>   |      |     |
>>>>  GSW---------------.... Gig Switch
>>>>   |
>>>>  FSW---------------.... Fast Switch
>>>>   |
>>>>   SD
>>> And the problem connection is between the hosts to the left... ok.
>> That is correct.
>>
>>> ...
>>>>>> 22-Oct 18:56 backupserver-sd: Spooling data ...
>>>>>> 22-Oct 18:56 fileserver-fd: fileserver-backup.2007-10-22_18.54.33
>>>>>> Fatal
>>>>>> error: backup.c:892 Network send error to SD. ERR=Success
>>>>> So the connection breaks shortly after data starts being transferred,
>>>>> right?
>>>> Correct, 2193816 is always written.
>>> Funny. Disk full on the SD, perhaps? Might be worth a look into the
>>> system log on both the machines.
>> No, that was one of the first things I checked.  The SD spool is a
>> dedicated logical volume of 740Gigs (over two tapes of data).  All FD's
>> write to the same spool.  When the schedule runs the job, it is not on
>> its
>> own, however, when I have been running it by hand, then it is the only
>> job
>> running.
>
> So we can be more or less sure it's got to do with the scheduling process.
>
> ...
>>> Good enough... regarding network problems, you could try to enable the
>>> heartbeat function in the FD and / or SD. To find the cause of the
>>> problem, tcpdump or wireshark might help.
>> I read about heart beat with the 3com issue, and switched it on for both
>> the FD and SD.  I have not tried tcpdump or wireshark, will give it a
>> go.
>
> Use the filtering options extensively - otherwise, you will be
> overloaded by the output :-)
>
>>> If you see RST packages on the connection between FD and SD it's only
>>> the question who generates them...
>>>
>>> ...
>>>>> Here it's failed, I think. A higher debug level might reveal more,
>>>>> but
>>>>> this doesn't tell me anything important.
>>>> I am probably going to get flamed for this,
>>> Not by me :-)
>>>
>>>> but what value, currently it
>>>> is set to 200, I do not want to put it too high, and swamp the amount
>>>> of
>>>> data I am supplying the mailing list, but neither do I want to waste
>>>> the
>>>> mailing lists time by making it too low....
>>> Really a difficult question :-)
>>>
>>> The best approach might be to run with debug level 400, save the
>>> resulting logs, and only post the part around the failure first. If
>>> someone needs more detail, you could post the complete log to a web
>>> site.
>>
>> Okay, will give 400 a go.
>>
>>> ...
>>>>>> backupserver ~ #
>>>>> With the information from above, I suspect a network problem. Does
>>>>> the
>>>>> client run before job you have run for a very long time? In such a
>>>>> situation, a firewall/router might close the connection between SD
>>>>> and
>>>>> FD because it seems to be idle.
>>>> The run before job might take half an hour max.  There is no firewall
>>>> or
>>>> router in the setup.
>>> Hmm... half an hour should not trigger a RST due to idleing too long.
>>> Do your other FDs on the network segment with the DIR have
>>> long-running scripts, too, or do they transfer data almost immediately
>>> after the backup jobs are started?
>> This is the only one with a script.  Surely if it has started to
>> transfer
>> data, the RST will not take place as it it no longer idle (just a
>> thought).
>
> Well, it might happen that some device or software decides to drop
> that connection earlier, but only sends RST packets when the
> connection is (according to its assumptions) invalid. This would be a
> behaviour often found in routers, I believe.
>
> You could try to run that same job with a dummy "Client Run Before"
> script which immediately exits, just to see what happens then.
>
> If this case works, and the heartbeat doesn't, then it's surely time
> for some network debugging, I think.
I think I have found the problem, the equipment is on a gig network, and
should be able to handle a bigger network buffer than 65536, as a result,
it had been set manually
Maximum Network Buffer Size = 16777216
this FD is the only one to have that set.  The SD also has this set.  If I
remove it from the FD, it works fine.  Why it stopped working all of a
sudden, I do not know.

Regards,

Spencer




-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to