Hi,
On 2/12/2006 8:39 AM, Mike wrote:
Just to follow up with this (I work with Andreas)
> > Currently we are attempting to backup 47 servers per night with very
> > inconsistent results.
> >
> > It would seem that more than half of the backups (full and
incrementals) are
> > failing due to a network timeout error (Network error on data channel.
> > ERR=Operation timed out).
>
> Some sort of intelligent (or semi-intelligent) switchhes in between? In
> such a case, you might try setting the heartbeat interval options
> wherever possible.
We actually had two different setups- one "front-end" group for
untrusted servers,
which were either directly connected through a riverstone switch
(managed, L3),
or going out over the internet to remote locations. As well as a
"back-end" trusted
group, which were all directly connected by a Cataylst 2924.
We actually started off with a 1900's series Catalyst (10 meg) on the
back-end, but
switched it out with a 2924 when we thought it may have been causing the
problems.
We've recently moved all backups to the front-end network, just to see
if it was an
issue with running both at the same time- it made no difference.
So several different types of switches, with no clear pattern as to one
being the
culprit.
That's becoming an interesting issue... have you tried to observe the
connections in question from the switches point of view? And yes, I know
that this sort of stuff is not really fun :-(
We've also been using the heartbeat interval, on both the FD's and the
SD the whole
time- setup with 10 second intervals- is that too long? too short?
Usually, I'd considert that rather short - after all, the heartbeat
interval would best be set to a time just below the shortest timeout in
between, and I really hope that there isn't any equipment around that
times ot IP connections after only 10 seconds...
But, apart from producing some unneccesary network load, I don't think
there's something like a too short heartbeat interval.
> Just to verify, we're using this on the FD and SD
--snip--
Heartbeat Interval = 10 seconds
--snip--
?
> > I am running out of ideas and may have to start investigating other
backup
> > options should I not be able to resolve this. I like bacula and the
way it
> > works so I would rather not move away from it if it can be helped.
>
> That's the spirit...
>
> Ok, what I would do:
> As a first step, only run backup at a time. Second, observe the load on
> the systems while doing backups - it might happen that a network
> connection is closed because the process it belongs to doesn't answer
> because of high load (something I never observed under linux, but I
> *believe* that it can happen under windows). Also, network operations
> can become unreliable under high load conditions with bad (=cheap)
> network adapters in my experience. In such a case, tweaking the ip stack
> settings might help, but you'd better ask your local network or FreeBSD
> admin then...
A lot of the hardware is varied- but we do have a lot of new Dell machines
(1805's and 2850's)- these all have *good* Intel GB NIC's in them (em
driver
under FreeBSD), as well as a lot of fairly recent IBM gear (again, good
NIC's,
with stable drivers under FreeBSD).
This should rule out the sort of problems I indicated.
The failures don't tend to gravitate towards the *questionable*
hardware, or
to a specific OS. It seems that it'll work one night fine on a machine, the
next night it fails.
Difficult, indeed.
> Also, observe the network and load on the backup server, and,if
> possible, see what happens at your switches (only possible with managed
> ones, obviously).
The backup machine itself is running on a new 1850- during a normal nightly
backup run (+/- 15 backups running), there is minimal network/system load-
much less than other systems handle on a regular basis.
Right, with recent hardware there shouldn't be any bottlenecks leading
to network connection aborts. One more possible reson ruled out...
I've also tested another application we use on this machine, which
generates
large amounts of network traffic (~4 Mbits sustained, from several
different
sources), and it seems to handle the load fine.
Bad.
> If a one-backup-a-time setup works, increase the number of simultaneous
> backups step by step, and keep your observations running.
>
> If you see the same sort of problems see if you can reproduce it - is it
> always happening when a large number of clients simultaneously send
> data, for example, which would indicate a network equipment failure.
>
> If the problem happens again, try limiting network bandwith - set your
> NICs to 100MBit, for example, or use any sort of traffic shaping your OS
> or switches allow.
>
> The last item is what you should try if even with a one-job-a-time setup
> jobs are aborted.
We've tested doing one backup at a time- it seems that it can fail just as
easily with just one backup running, as it can with several.
That makes it more or less impossible for me to come up with anything I
consider useful - I guess it's time to use strace and / or gdb to get
more information.
For example, I just started a backup of one of our mail servers (on an
IBM x340), it was the only thing running and it ran for ~36 minutes, and
then died with the error:
--snip--
11-Feb 20:57 director: Start Backup JobId 6030,
Job=mail-host.2006-02-11_20.57.39
11-Feb 20:57 storage: Spooling data ...
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Fatal error:
append.c:235 Network error on data channel. ERR=Operation timed out
Still I assume that this error - network timeout - should be caused by
one of the OSes or the network equipment.
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Fatal error:
backup.c:498 Network send error to SD. ERR=Broken pipe
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Error: bnet.c:257
Read error from client:x.x.x.x:36643: ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Error: bnet.c:425
Write error sending 32768 bytes to Storage daemon:y.y.y.y:9103:
ERR=Broken pipe
11-Feb 22:33 director: mail-host.2006-02-11_20.57.39 Error: Bacula
1.38.2 (20Nov05): 11-Feb-2006 22:33:10
JobId: 6030
Job: mail-host.2006-02-11_20.57.39
Backup Level: Full
Client: "mail-host"
i386-unknown-freebsd4.11,freebsd,4.11-PRERELEASE
FileSet: "mail-host" 2005-11-05 00:00:04
Pool: "corp_pool"
Storage: "Autochanger"
Scheduled time: 11-Feb-2006 20:57:35
Start time: 11-Feb-2006 20:57:41
End time: 11-Feb-2006 22:33:10
Priority: 10
FD Files Written: 301,065
SD Files Written: 301,065
FD Bytes Written: 6,063,875,529
SD Bytes Written: 6,100,519,734
Rate: 1058.5 KB/s
Software Compression: None
Volume name(s):
Volume Session Id: 4
Volume Session Time: 1139719983
Last Volume Bytes: 1,788,911,417
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
--snip--
We've also tried to vary our schedules by spreading the backup start
times out between 12am and 5am, as well as spreading out the the
Full and Incremental backups, to different days/times- with no change
to this error.
Really not so good.
> > We are running the backups using bacula 1.38.2 on a FreeBSD 5.4 server.
>
> The usual advice: Upgrade to 1.38.5 (shouldn't matter, but you never
> know...) and considering OS / network tweaking to remain reliable in
> high load situations you better ask someone who actually knows that OS,
> not me :-)
>
> Anyway, if you simply can't find a reason for your problem, it would
> seem best to run some serious network load tests - if the problem isn't
> with Bacula any other solution might suffer from the same problem.
We were actually running Bacula 1.36.3 before this, and upgraded to
1.38.2 in an attempt to resolve this problem ;)
Could this be a Bacula + FreeBSD issue? Are there many people out there
running the Bacula DIR and SD under FreeBSD? (we're using FreeBSD 5.4)
As far as I know there are quite a number of people running FreeBSD. You
might start another thread with a subject line including FreeBSD - Dan
for example seems to know FreeBSD quite good, but I doubt that he
followed this thread.
Could this be a spooling related problem? About the only pattern I can see,
is that it tends to happen while the job is spooling to disk.
That could also be caused by other problems, because only during
spooling here is serious network traffic happening. Of course there
might also be some issue with simultaneous network and disk I/O.
On a related note, I use Bacula (1.36.3) on a personal system, on a FreeBSD
5.4 machine, backing up to disk- it only has 6 machines to backup (three
local, and three remote) and significantly less data, but I've *never* had
any problems with it in this regard- in-fact- it saved my butt the other
day
when my development machine died, and I had to restore- :)
That sounds more like what I'm used to :-)
Any help would be appreciated,
Difficult... a very close examination of the running programs might help
more, but that will take lots of time and someone who really knows that
stuff. Or, although I'm not happy to suggest that, try another backup
solution. Most of the big ones should allow you an evaluation period,
although these usually don't really support FreeBSD as a backup server
operating system.
I still hope you get this solved...
Arno
Cheers,
Mike
--
IT-Service Lehmann [EMAIL PROTECTED]
Arno Lehmann http://www.its-lehmann.de
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users