Re: [Bacula-users] Inconsistent backup success rate

Arno Lehmann Tue, 14 Feb 2006 00:48:06 -0800

Hi,

On 2/12/2006 8:39 AM, Mike wrote:

Just to follow up with this (I work with Andreas)
> > Currently we are attempting to backup 47 servers per night with very
 > > inconsistent results.
 > >
> > It would seem that more than half of the backups (full andincrementals) are
 > > failing due to a network timeout error (Network error on data channel.
 > > ERR=Operation timed out).
 >
 > Some sort of intelligent (or semi-intelligent) switchhes in between? In
 > such a case, you might try setting the heartbeat interval options
 > wherever possible.
We actually had two different setups- one "front-end" group foruntrusted servers,which were either directly connected through a riverstone switch(managed, L3),or going out over the internet to remote locations. As well as a"back-end" trusted
group, which were all directly connected by a Cataylst 2924.
We actually started off with a 1900's series Catalyst (10 meg) on theback-end, butswitched it out with a 2924 when we thought it may have been causing theproblems.We've recently moved all backups to the front-end network, just to seeif it was an
issue with running both at the same time- it made no difference.
So several different types of switches, with no clear pattern as to onebeing the
culprit.

That's becoming an interesting issue... have you tried to observe theconnections in question from the switches point of view? And yes, I knowthat this sort of stuff is not really fun :-(

We've also been using the heartbeat interval, on both the FD's and theSD the whole
time- setup with 10 second intervals- is that too long? too short?

Usually, I'd considert that rather short - after all, the heartbeatinterval would best be set to a time just below the shortest timeout inbetween, and I really hope that there isn't any equipment around thattimes ot IP connections after only 10 seconds...

But, apart from producing some unneccesary network load, I don't thinkthere's something like a too short heartbeat interval.


 > Just to verify, we're using this on the FD and SD

--snip--Heartbeat Interval = 10 seconds--snip--?> > I am running out of ideas and may have to start investigating otherbackup> > options should I not be able to resolve this. I like bacula and theway it
 > > works so I would rather not move away from it if it can be helped.
 >
 > That's the spirit...
 >
 > Ok, what I would do:
 > As a first step, only run backup at a time. Second, observe the load on
 > the systems while doing backups - it might happen that a network
 > connection is closed because the process it belongs to doesn't answer
 > because of high load (something I never observed under linux, but I
 > *believe* that it can happen under windows). Also, network operations
 > can become unreliable under high load conditions with bad (=cheap)
 > network adapters in my experience. In such a case, tweaking the ip stack
 > settings might help, but you'd better ask your local network or FreeBSD
 > admin then...
A lot of the hardware is varied- but we do have a lot of new Dell machines(1805's and 2850's)- these all have *good* Intel GB NIC's in them (emdriverunder FreeBSD), as well as a lot of fairly recent IBM gear (again, goodNIC's,
with stable drivers under FreeBSD).


This should rule out the sort of problems I indicated.

The failures don't tend to gravitate towards the *questionable*hardware, or
to a specific OS. It seems that it'll work one night fine on a machine, the
next night it fails.


Difficult, indeed.

> Also, observe the network and load on the backup server, and,if
 > possible, see what happens at your switches (only possible with managed
 > ones, obviously).
The backup machine itself is running on a new 1850- during a normal nightly
backup run (+/- 15 backups running), there is minimal network/system load-
much less than other systems handle on a regular basis.

Right, with recent hardware there shouldn't be any bottlenecks leadingto network connection aborts. One more possible reson ruled out...

I've also tested another application we use on this machine, whichgenerateslarge amounts of network traffic (~4 Mbits sustained, from severaldifferent
sources), and it seems to handle the load fine.


Bad.

> If a one-backup-a-time setup works, increase the number of simultaneous

 > backups step by step, and keep your observations running.
 >
 > If you see the same sort of problems see if you can reproduce it - is it
 > always happening when a large number of clients simultaneously send
 > data, for example, which would indicate a network equipment failure.
 >
 > If the problem happens again, try limiting network bandwith - set your
 > NICs to 100MBit, for example, or use any sort of traffic shaping your OS
 > or switches allow.
 >
 > The last item is what you should try if even with a one-job-a-time setup
 > jobs are aborted.

We've tested doing one backup at a time- it seems that it can fail just as

easily with just one backup running, as it can with several.

That makes it more or less impossible for me to come up with anything Iconsider useful - I guess it's time to use strace and / or gdb to getmore information.

For example, I just started a backup of one of our mail servers (on an
IBM x340), it was the only thing running and it ran for ~36 minutes, and
then died with the error:
--snip--11-Feb 20:57 director: Start Backup JobId 6030,Job=mail-host.2006-02-11_20.57.39
11-Feb 20:57 storage: Spooling data ...
11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Fatal error:append.c:235 Network error on data channel. ERR=Operation timed out

Still I assume that this error - network timeout - should be caused byone of the OSes or the network equipment.

11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Fatal error:backup.c:498 Network send error to SD. ERR=Broken pipe11-Feb 22:32 storage: mail-host.2006-02-11_20.57.39 Error: bnet.c:257Read error from client:x.x.x.x:36643: ERR=Operation timed out11-Feb 22:32 mail-host: mail-host.2006-02-11_20.57.39 Error: bnet.c:425Write error sending 32768 bytes to Storage daemon:y.y.y.y:9103:ERR=Broken pipe11-Feb 22:33 director: mail-host.2006-02-11_20.57.39 Error: Bacula1.38.2 (20Nov05): 11-Feb-2006 22:33:10
  JobId:                  6030
  Job:                    mail-host.2006-02-11_20.57.39
  Backup Level:           Full
Client: "mail-host"i386-unknown-freebsd4.11,freebsd,4.11-PRERELEASE
  FileSet:                "mail-host" 2005-11-05 00:00:04
  Pool:                   "corp_pool"
  Storage:                "Autochanger"
  Scheduled time:         11-Feb-2006 20:57:35
  Start time:             11-Feb-2006 20:57:41
  End time:               11-Feb-2006 22:33:10
  Priority:               10
  FD Files Written:       301,065
  SD Files Written:       301,065
  FD Bytes Written:       6,063,875,529
  SD Bytes Written:       6,100,519,734
  Rate:                   1058.5 KB/s
  Software Compression:   None
Volume name(s):Volume Session Id: 4
  Volume Session Time:    1139719983
  Last Volume Bytes:      1,788,911,417
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***
--snip--We've also tried to vary our schedules by spreading the backup start
times out between 12am and 5am, as well as spreading out the the
Full and Incremental backups, to different days/times- with no change
to this error.


Really not so good.


 > > We are running the backups using bacula 1.38.2 on a FreeBSD 5.4 server.
 >
 > The usual advice: Upgrade to 1.38.5 (shouldn't matter, but you never
 > know...) and considering OS / network tweaking to remain reliable in
 > high load situations you better ask someone who actually knows that OS,
 > not me :-)
 >
 > Anyway, if you simply can't find a reason for your problem, it would
 > seem best to run some serious network load tests - if the problem isn't
 > with Bacula any other solution might suffer from the same problem.

We were actually running Bacula 1.36.3 before this, and upgraded to

1.38.2 in an attempt to resolve this problem ;)

Could this be a Bacula + FreeBSD issue? Are there many people out there

running the Bacula DIR and SD under FreeBSD? (we're using FreeBSD 5.4)

As far as I know there are quite a number of people running FreeBSD. Youmight start another thread with a subject line including FreeBSD - Danfor example seems to know FreeBSD quite good, but I doubt that hefollowed this thread.

Could this be a spooling related problem? About the only pattern I can see,
is that it tends to happen while the job is spooling to disk.

That could also be caused by other problems, because only duringspooling here is serious network traffic happening. Of course theremight also be some issue with simultaneous network and disk I/O.

On a related note, I use Bacula (1.36.3) on a personal system, on a FreeBSD
5.4 machine, backing up to disk- it only has 6 machines to backup (three
local, and three remote) and significantly less data, but I've *never* had

any problems with it in this regard- in-fact- it saved my butt the otherday

when my development machine died, and I had to restore- :)


That sounds more like what I'm used to :-)

Any help would be appreciated,

Difficult... a very close examination of the running programs might helpmore, but that will take lots of time and someone who really knows thatstuff. Or, although I'm not happy to suggest that, try another backupsolution. Most of the big ones should allow you an evaluation period,although these usually don't really support FreeBSD as a backup serveroperating system.


I still hope you get this solved...

Arno

Cheers,

Mike


--
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Inconsistent backup success rate

Reply via email to