Just to follow up with this (I work with Andreas)
> > Currently we are attempting to backup 47 servers per
night with very
> > inconsistent results.
> >
> > It would seem that more than half of the backups (full and incrementals) are
> > failing due to a network timeout error (Network error on data channel.
> > ERR=Operation timed out).
>
> Some sort of intelligent (or semi-intelligent) switchhes in between? In
> such a case, you might try setting the heartbeat interval options
> wherever possible.
> > inconsistent results.
> >
> > It would seem that more than half of the backups (full and incrementals) are
> > failing due to a network timeout error (Network error on data channel.
> > ERR=Operation timed out).
>
> Some sort of intelligent (or semi-intelligent) switchhes in between? In
> such a case, you might try setting the heartbeat interval options
> wherever possible.
We actually had two different setups- one "front-end" group
for untrusted servers,
which were either directly connected through a riverstone
switch (managed, L3),
or going out over the internet to remote locations. As well as
a "back-end" trusted
group, which were all directly connected by a Cataylst
2924.
We actually started off with a 1900's series Catalyst (10
meg) on the back-end, but
switched it out with a 2924 when we thought it may have been
causing the problems.
We've recently moved all backups to the front-end network,
just to see if it was an
issue with running both at the same time- it made no
difference.
So several different types of switches, with no clear pattern
as to one being the
culprit.
We've also been using the heartbeat interval, on both the FD's and the SD the whole
time- setup with 10 second intervals- is that too long? too
short?
Just to verify, we're using this on the FD and SD
--snip--
Heartbeat Interval = 10 seconds
--snip--
?
> > I am running out of ideas and may have to start
investigating other backup
> > options should I not be able to resolve this. I like bacula and the way it
> > works so I would rather not move away from it if it can be helped.
>
> That's the spirit...
>
> Ok, what I would do:
> As a first step, only run backup at a time. Second, observe the load on
> the systems while doing backups - it might happen that a network
> connection is closed because the process it belongs to doesn't answer
> because of high load (something I never observed under linux, but I
> *believe* that it can happen under windows). Also, network operations
> can become unreliable under high load conditions with bad (=cheap)
> network adapters in my experience. In such a case, tweaking the ip stack
> settings might help, but you'd better ask your local network or FreeBSD
> admin then...
> > options should I not be able to resolve this. I like bacula and the way it
> > works so I would rather not move away from it if it can be helped.
>
> That's the spirit...
>
> Ok, what I would do:
> As a first step, only run backup at a time. Second, observe the load on
> the systems while doing backups - it might happen that a network
> connection is closed because the process it belongs to doesn't answer
> because of high load (something I never observed under linux, but I
> *believe* that it can happen under windows). Also, network operations
> can become unreliable under high load conditions with bad (=cheap)
> network adapters in my experience. In such a case, tweaking the ip stack
> settings might help, but you'd better ask your local network or FreeBSD
> admin then...
A lot of the hardware is varied- but we do have a lot of new
Dell machines
(1805's and 2850's)- these all have *good* Intel GB NIC's in
them (em driver
under FreeBSD), as well as a lot of fairly recent IBM
gear (again, good NIC's,
with stable drivers under FreeBSD).
The failures don't tend to gravitate towards the
*questionable* hardware, or
to a specific OS. It seems that it'll work one night fine on a
machine, the
next night it fails.
> Also, observe the network and load on the backup server,
and,if
> possible, see what happens at your switches (only possible with managed
> ones, obviously).
> possible, see what happens at your switches (only possible with managed
> ones, obviously).
The backup machine itself is running on a new 1850- during a
normal nightly
backup run (+/- 15 backups running), there is minimal
network/system load-
much less than other systems handle on a regular
basis.
I've also tested another application we use on this machine,
which generates
large amounts of network traffic (~4 Mbits sustained, from
several different
sources), and it seems to handle the load fine.
> If a one-backup-a-time setup works, increase the number
of simultaneous
> backups step by step, and keep your observations running.
>
> If you see the same sort of problems see if you can reproduce it - is it
> always happening when a large number of clients simultaneously send
> data, for example, which would indicate a network equipment failure.
>
> If the problem happens again, try limiting network bandwith - set your
> NICs to 100MBit, for example, or use any sort of traffic shaping your OS
> or switches allow.
>
> The last item is what you should try if even with a one-job-a-time setup
> jobs are aborted.
> backups step by step, and keep your observations running.
>
> If you see the same sort of problems see if you can reproduce it - is it
> always happening when a large number of clients simultaneously send
> data, for example, which would indicate a network equipment failure.
>
> If the problem happens again, try limiting network bandwith - set your
> NICs to 100MBit, for example, or use any sort of traffic shaping your OS
> or switches allow.
>
> The last item is what you should try if even with a one-job-a-time setup
> jobs are aborted.
We've tested doing one backup at a time- it seems that it can
fail just as
easily with just one backup running, as it can with
several.
For example, I just started a backup of one of our mail
servers (on an
IBM x340), it was the only thing running and it ran for ~36
minutes, and
then died with the error:
--snip--
11-Feb 20:57 director: Start Backup JobId 6030,
Job=mail-host.2006-02-11_20.57.39
11-Feb 20:57 storage: Spooling data ...
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Fatal error: append.c:235 Network error on data channel. ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Fatal error: backup.c:498 Network send error to SD. ERR=Broken pipe
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Error: bnet.c:257 Read error from client:x.x.x.x:36643: ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Error: bnet.c:425 Write error sending 32768 bytes to Storage daemon:y.y.y.y:9103: ERR=Broken pipe
11-Feb 22:33 director: mail-host.2006-02-11_20.57.39 Error: Bacula 1.38.2 (20Nov05): 11-Feb-2006 22:33:10
11-Feb 20:57 storage: Spooling data ...
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Fatal error: append.c:235 Network error on data channel. ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Fatal error: backup.c:498 Network send error to SD. ERR=Broken pipe
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Error: bnet.c:257 Read error from client:x.x.x.x:36643: ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Error: bnet.c:425 Write error sending 32768 bytes to Storage daemon:y.y.y.y:9103: ERR=Broken pipe
11-Feb 22:33 director: mail-host.2006-02-11_20.57.39 Error: Bacula 1.38.2 (20Nov05): 11-Feb-2006 22:33:10
JobId:
6030
Job: mail-host.2006-02-11_20.57.39
Backup Level: Full
Client: "mail-host" i386-unknown-freebsd4.11,freebsd,4.11-PRERELEASE
FileSet: "mail-host" 2005-11-05 00:00:04
Pool: "corp_pool"
Storage: "Autochanger"
Scheduled time: 11-Feb-2006 20:57:35
Start time: 11-Feb-2006 20:57:41
End time: 11-Feb-2006 22:33:10
Priority: 10
FD Files Written: 301,065
SD Files Written: 301,065
FD Bytes Written: 6,063,875,529
SD Bytes Written: 6,100,519,734
Rate: 1058.5 KB/s
Software Compression: None
Volume name(s):
Volume Session Id: 4
Volume Session Time: 1139719983
Last Volume Bytes: 1,788,911,417
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
Job: mail-host.2006-02-11_20.57.39
Backup Level: Full
Client: "mail-host" i386-unknown-freebsd4.11,freebsd,4.11-PRERELEASE
FileSet: "mail-host" 2005-11-05 00:00:04
Pool: "corp_pool"
Storage: "Autochanger"
Scheduled time: 11-Feb-2006 20:57:35
Start time: 11-Feb-2006 20:57:41
End time: 11-Feb-2006 22:33:10
Priority: 10
FD Files Written: 301,065
SD Files Written: 301,065
FD Bytes Written: 6,063,875,529
SD Bytes Written: 6,100,519,734
Rate: 1058.5 KB/s
Software Compression: None
Volume name(s):
Volume Session Id: 4
Volume Session Time: 1139719983
Last Volume Bytes: 1,788,911,417
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
--snip--
We've also tried to vary our schedules by spreading the
backup start
times out between 12am and 5am, as well as spreading out the
the
Full and Incremental backups, to different days/times- with no
change
to this error.
> > We are running the backups using bacula 1.38.2 on a FreeBSD 5.4 server.
>
> The usual advice: Upgrade to 1.38.5 (shouldn't matter, but you never
> know...) and considering OS / network tweaking to remain reliable in
> high load situations you better ask someone who actually knows that OS,
> not me :-)
>
> Anyway, if you simply can't find a reason for your problem, it would
> seem best to run some serious network load tests - if the problem isn't
> with Bacula any other solution might suffer from the same problem.
We were actually running Bacula 1.36.3 before this, and
upgraded to
1.38.2 in an attempt to resolve this problem ;)
Could this be a Bacula + FreeBSD issue? Are there many people
out there
running the Bacula DIR and SD under FreeBSD? (we're using
FreeBSD 5.4)
Could this be a spooling related problem? About the only
pattern I can see,
is that it tends to happen while the job is spooling to
disk.
On a related note, I use Bacula (1.36.3) on a personal system,
on a FreeBSD
5.4 machine, backing up to disk- it only has 6 machines to
backup (three
local, and three remote) and significantly
less data, but I've *never* had
any problems with it in this regard- in-fact- it saved my butt
the other day
when my development machine died, and I had to restore-
:)
Any help would be appreciated,
Cheers,
Mike