Just to follow up with this (I work with Andreas)
 
 
> > Currently we are attempting to backup 47 servers per night with very
> > inconsistent results.
> >
> > It would seem that more than half of the backups (full and incrementals) are
> > failing due to a network timeout error (Network error on data channel.
> > ERR=Operation timed out).
>
> Some sort of intelligent (or semi-intelligent) switchhes in between? In
> such a case, you might try setting the heartbeat interval options
> wherever possible.
 
We actually had two different setups- one "front-end" group for untrusted servers,
which were either directly connected through a riverstone switch (managed, L3),
or going out over the internet to remote locations. As well as a "back-end" trusted
group, which were all directly connected by a Cataylst 2924.
 
We actually started off with a 1900's series Catalyst (10 meg) on the back-end, but
switched it out with a 2924 when we thought it may have been causing the problems.
 
We've recently moved all backups to the front-end network, just to see if it was an
issue with running both at the same time- it made no difference.
 
So several different types of switches, with no clear pattern as to one being the
culprit.

We've also been using the heartbeat interval, on both the FD's and the SD the whole
time- setup with 10 second intervals- is that too long? too short?
 
Just to verify, we're using this on the FD and SD
 
--snip--
 
Heartbeat Interval = 10 seconds
 
--snip--
 
?
 
> > I am running out of ideas and may have to start investigating other backup
> > options should I not be able to resolve this. I like bacula and the way it
> > works so I would rather not move away from it if it can be helped.
>
> That's the spirit...
>
> Ok, what I would do:
> As a first step, only run backup at a time. Second, observe the load on
> the systems while doing backups - it might happen that a network
> connection is closed because the process it belongs to doesn't answer
> because of high load (something I never observed under linux, but I
> *believe* that it can happen under windows). Also, network operations
> can become unreliable under high load conditions with bad (=cheap)
> network adapters in my experience. In such a case, tweaking the ip stack
> settings might help, but you'd better ask your local network or FreeBSD
> admin then...
 
A lot of the hardware is varied- but we do have a lot of new Dell machines
(1805's and 2850's)- these all have *good* Intel GB NIC's in them (em driver
under FreeBSD), as well as a lot of fairly recent IBM gear (again, good NIC's,
with stable drivers under FreeBSD).
 
The failures don't tend to gravitate towards the *questionable* hardware, or
to a specific OS. It seems that it'll work one night fine on a machine, the
next night it fails.
 
 
> Also, observe the network and load on the backup server, and,if
> possible, see what happens at your switches (only possible with managed
> ones, obviously).
 
 
The backup machine itself is running on a new 1850- during a normal nightly
backup run (+/- 15 backups running), there is minimal network/system load-
much less than other systems handle on a regular basis.
 
I've also tested another application we use on this machine, which generates
large amounts of network traffic (~4 Mbits sustained, from several different
sources), and it seems to handle the load fine.
 
 
> If a one-backup-a-time setup works, increase the number of simultaneous
> backups step by step, and keep your observations running.
>
> If you see the same sort of problems see if you can reproduce it - is it
> always happening when a large number of clients simultaneously send
> data, for example, which would indicate a network equipment failure.
>
> If the problem happens again, try limiting network bandwith - set your
> NICs to 100MBit, for example, or use any sort of traffic shaping your OS
> or switches allow.
>
> The last item is what you should try if even with a one-job-a-time setup
> jobs are aborted.
 
We've tested doing one backup at a time- it seems that it can fail just as
easily with just one backup running, as it can with several.
 
For example, I just started a backup of one of our mail servers (on an
IBM x340), it was the only thing running and it ran for ~36 minutes, and
then died with the error:
 
--snip--
 
11-Feb 20:57 director: Start Backup JobId 6030, Job=mail-host.2006-02-11_20.57.39
11-Feb 20:57 storage: Spooling data ...
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Fatal error: append.c:235 Network error on data channel. ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Fatal error: backup.c:498 Network send error to SD. ERR=Broken pipe
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Error: bnet.c:257 Read error from client:x.x.x.x:36643: ERR=Operation timed out
11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Error: bnet.c:425 Write error sending 32768 bytes to Storage daemon:y.y.y.y:9103: ERR=Broken pipe
11-Feb 22:33 director: mail-host.2006-02-11_20.57.39 Error: Bacula 1.38.2 (20Nov05): 11-Feb-2006 22:33:10
  JobId:                  6030
  Job:                    mail-host.2006-02-11_20.57.39
  Backup Level:           Full
  Client:                 "mail-host" i386-unknown-freebsd4.11,freebsd,4.11-PRERELEASE
  FileSet:                "mail-host" 2005-11-05 00:00:04
  Pool:                   "corp_pool"
  Storage:                "Autochanger"
  Scheduled time:         11-Feb-2006 20:57:35
  Start time:             11-Feb-2006 20:57:41
  End time:               11-Feb-2006 22:33:10
  Priority:               10
  FD Files Written:       301,065
  SD Files Written:       301,065
  FD Bytes Written:       6,063,875,529
  SD Bytes Written:       6,100,519,734
  Rate:                   1058.5 KB/s
  Software Compression:   None
  Volume name(s):        
  Volume Session Id:      4
  Volume Session Time:    1139719983
  Last Volume Bytes:      1,788,911,417
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***
 
--snip--
 
We've also tried to vary our schedules by spreading the backup start
times out between 12am and 5am, as well as spreading out the the
Full and Incremental backups, to different days/times- with no change
to this error.
 

> > We are running the backups using bacula 1.38.2 on a FreeBSD 5.4 server.
>
> The usual advice: Upgrade to 1.38.5 (shouldn't matter, but you never
> know...) and considering OS / network tweaking to remain reliable in
> high load situations you better ask someone who actually knows that OS,
> not me :-)
>
> Anyway, if you simply can't find a reason for your problem, it would
> seem best to run some serious network load tests - if the problem isn't
> with Bacula any other solution might suffer from the same problem.
 
We were actually running Bacula 1.36.3 before this, and upgraded to
1.38.2 in an attempt to resolve this problem ;)
 
Could this be a Bacula + FreeBSD issue? Are there many people out there
running the Bacula DIR and SD under FreeBSD? (we're using FreeBSD 5.4)
 
Could this be a spooling related problem? About the only pattern I can see,
is that it tends to happen while the job is spooling to disk.
 
On a related note, I use Bacula (1.36.3) on a personal system, on a FreeBSD
5.4 machine, backing up to disk- it only has 6 machines to backup (three
local, and three remote) and significantly less data, but I've *never* had
any problems with it in this regard- in-fact- it saved my butt the other day
when my development machine died, and I had to restore- :)
 
Any help would be appreciated,

Cheers,
 
Mike

Reply via email to