On Fri, 2007-09-28 at 08:46 +0200, Arno Lehmann wrote:
> Hello,
> 
> 27.09.2007 22:47,, Ross Boylan wrote::
> > On Thu, 2007-09-27 at 09:19 +0200, Arno Lehmann wrote:
> >> Hi,
> >>
> >> 27.09.2007 01:17,, Ross Boylan wrote::
> >>> I've been having really slow backups (13 hours) when I backup a large
> >>> mail spool.  I've attached a run report.  There are about 1.4M files
> >>> with a compressed size of 4G.  I get much better throughput (e.g.,
> >>> 2,000KB/s vs 86KB/s for this job!) with other jobs.
> >> 2MB/s is still not especially fast for a backup to disk, I think. So 
> >> your storage disk might also be a factor here.
> >>
> >>> First, does it sound as if something is wrong?  I suspect the number of
> >>> files is the key thing, and the mail  spool has lots of little files
> >>> (it's used by Cyrus).  Is this just life when you have lots of little
> >>> files?
> >>>
> >>> Second, how can I figure out what the problem is?  I do have some
> >>> suspicions, but first some basics:
> >>> ------------------------------------------------
> >>> everything is running on the same box
> >>> 3GHz P4 with one SATA drive as the main drive and 4 older drives, one of
> >>> which is the backup target.
> >>> No noticeable CPU load or disk activity during the backup.  I was
> >>> compressing, but that doesn't show up noticeably for CPU use.
> >> How much memory, and how is the memory usage during backups?
> > 2G of RAM.  I'll have to watch it to determine how much is in use.
> 
> 2 GB sounds ok to me, but you might find that tuning the database 
> helps a bit.
> 
> ...
> >>> I am not using snapshotting because that feature is broken right now
> >>> (nothiing to  do with bacula).  I shut down the cyrus server during the
> >>> backup (desspite some errors in the log around my attempted shutdown, it
> >>> seemed to have worked).
> >>>
> >>> My suspicion is that the TCP/IP transactions are all getting delayed
> >>> (maybe to batch for sending) in a way that usually isn't noticeable, but
> >>> is noticeable when doing lots of quick exchanges locally.
> >> I don't know anything about issues with TCP delays, and I know Bacula 
> >> installations running smoothly on all sorts of hardware and different 
> >> OSes.
> >>
> >> I rather suspect the catalog to be the bottle-neck.
> >>
> >> Verifying this might be as easy as running vmstat while the job is 
> >> backed up and seeing if there is lots of iowait happening - this does 
> >> not necessarily show as hard disk activity.
> > Would tcp induced delays also show up as iowait?
> 
> I'm not sure, because I still don't know what sort of TCP delays this 
> would be. Iowait would probably show up if the network driver has to 
> wait for the network adapter to process operations.
> 
> You could try to use some network benchmark to see if there are 
> throughput problems.
> 
> >> Are your database and the mail spool on the same disk? This might 
> >> explain the slowness you encounter.
> > Yes.
> 
> Hmm... this can be a major problem.
> 
> >> In this case, I'd suggest to upgrade to Bacula 2.2.4. For two reasons, 
> >> actually: There is a serious bug that will hit you one day, and which 
> >> is fixed in the current version. Second, the new batch inserts feature 
> >> would gain lots of speed if the database throughput really is the 
> >> bottle neck for you.
> > I see 2.2.4 is in Debian unstable, so I should be able to pull it in.
> > That would be great if it speeds things up.
> 
> Please let us know if this upgrade alone has good results.
> 
> ...
> >>> ######## Cyrus
> >>> ## really this needs more care: use snapshot, dump db to ascii
> >> As far as I know, it's sufficient to dump cyrus' database. Given that 
> >> dump and a backup of your mail files, a correct cyrus database can be 
> >> easily regenerated. Snapshots would be a good thing, perhaps, but 
> >> you'd still have to explicitly dump the database as there is no 
> >> guarantee that the disk files of the database are always in a 
> >> consistent state.
> > cyrus recommends the ascii dump to guard against version changes that
> > would render the binary unusable.
> 
> True, but if you restore to the same version of cyrus (actually, 
> that's the database version they use) this would not be the main 
> problem. Restoring to a different OS/distribution version should 
> definitely be done with the ascii dump.
> 
> > http://cyrusimap.web.cmu.edu/twiki/bin/view/Cyrus/Backup has more.
> > You're right: snapshots alone will not assure integrity.
> > .....
> >> I'm really unsure about TCP problems, but the situation more or less 
> >> looks like the catalog backend would be your problem. Could you try to 
> >> have the catalog db on another machine?
> > I've only got the one for now.
> 
> vmstat during a backup would be a good next step in this case, I think.
> 

Here are the results of a test job.  The first vmstat was shortly after
I started the job
# vmstat 15
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
 1  2   7460  50760 204964 667288    0    0    43    32  197   15 18  5
75  2
 1  1   6852  51476 195492 675524   28    0  1790   358  549 1876 20  6
36 38
 0  2   6852  51484 189332 682612    0    0  1048   416  470 1321 12  4
41 43
 2  0   6852  52508 187344 685328    0    0   303   353  485 1369 16  4
68 12
 1  0   6852  52108 187352 685464    0    0     1   144  468 1987 12  4
84  0

Sorry for the bad wrapping.  This clearly shows about 40% of the CPU
time spent in IO wait during the backup.  Another 40% is idle.  I'm not
sure if the reports are being thrown off by the fact that I have 2
virtual CPU's (not really: it's P4 with hyperthreading).  If that's the
case, the 40% might really mean 80%.

During the run I observed little CPU or memory useage above where I was
before it.  None of the bacula daemons, postgres or bzip got anywhere
near the top of my cpu use list (using ksysguard).

A second run went much faster: 14 seconds (1721.6 KB/s) vs 64 seconds
(376.6 KB/s) the first time.  Both are much better than I got with my
original, bigger jobs.  It was so quick I think vmstat missed it
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
 1  0   6852  56496 184148 683932    0    0    43    32  197   19 18  5
75  2
 3  0   6852  56016 178604 690024    0  113     0   429  524 3499 35 10
55  0
 2  0   6852  51988 172476 701556    0    0     1  2023  418 3827 33 11
55  1

It looks as if the 2nd run only hit the cache, not the disk, while
reading the directory (bi is very low)--if I understand the output,
which is a big if.

Ross


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to