Re: [Bacula-users] painfully slow backups

Ross Boylan Wed, 03 Oct 2007 15:31:11 -0700

On Sun, 2007-09-30 at 23:19 +0200, Arno Lehmann wrote:
> Hello,
> 
> 30.09.2007 01:36,, Ross Boylan wrote::
> > On Sat, 2007-09-29 at 13:15 -0700, Ross Boylan wrote:
> >> On Fri, 2007-09-28 at 08:46 +0200, Arno Lehmann wrote:
> >>> Hello,
> >>>
> >>> 27.09.2007 22:47,, Ross Boylan wrote::
> >>>> On Thu, 2007-09-27 at 09:19 +0200, Arno Lehmann wrote:
> >>>>> Hi,
> >>>>>
> >>>>> 27.09.2007 01:17,, Ross Boylan wrote::
> >>>>>> I've been having really slow backups (13 hours) when I backup a large
> >>>>>> mail spool.  I've attached a run report.  There are about 1.4M files
> >>>>>> with a compressed size of 4G.  I get much better throughput (e.g.,
> >>>>>> 2,000KB/s vs 86KB/s for this job!) with other jobs.
> >>>>> 2MB/s is still not especially fast for a backup to disk, I think. So 
> >>>>> your storage disk might also be a factor here.
> > .....
> >>> vmstat during a backup would be a good next step in this case, I think.
> >>>
> >> Here are the results of a test job.  The first vmstat was shortly after
> >> I started the job
> >> # vmstat 15
> >> procs -----------memory---------- ---swap-- -----io---- -system--
> >> ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> >> id wa
> >>  1  2   7460  50760 204964 667288    0    0    43    32  197   15 18  5
> >> 75  2
> >>  1  1   6852  51476 195492 675524   28    0  1790   358  549 1876 20  6
> >> 36 38
> >>  0  2   6852  51484 189332 682612    0    0  1048   416  470 1321 12  4
> >> 41 43
> >>  2  0   6852  52508 187344 685328    0    0   303   353  485 1369 16  4
> >> 68 12
> >>  1  0   6852  52108 187352 685464    0    0     1   144  468 1987 12  4
> >> 84  0
> >>
> >> Sorry for the bad wrapping.  This clearly shows about 40% of the CPU
> >> time spent in IO wait during the backup.  Another 40% is idle.  I'm not
> >> sure if the reports are being thrown off by the fact that I have 2
> >> virtual CPU's (not really: it's P4 with hyperthreading).  If that's the
> >> case, the 40% might really mean 80%.
> 
> Interesting question... I never thought about that, and the man page 
> writers for vmstat on my system didn't either. I suppose that vmstat 
> bases its output on the overall available CPU time, i.e. you have 40% 
> of all available CPU time spent in IOwait. Like, one (HT) CPU spends 
> 80% waiting, the other no time at all.
> 
> >> During the run I observed little CPU or memory useage above where I was
> >> before it.  None of the bacula daemons, postgres or bzip got anywhere
> >> near the top of my cpu use list (using ksysguard).
> >>
> >> A second run went much faster: 14 seconds (1721.6 KB/s) vs 64 seconds
> >> (376.6 KB/s) the first time.  Both are much better than I got with my
> >> original, bigger jobs.  It was so quick I think vmstat missed it
> >> procs -----------memory---------- ---swap-- -----io---- -system--
> >> ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> >> id wa
> >>  1  0   6852  56496 184148 683932    0    0    43    32  197   19 18  5
> >> 75  2
> >>  3  0   6852  56016 178604 690024    0  113     0   429  524 3499 35 10
> >> 55  0
> >>  2  0   6852  51988 172476 701556    0    0     1  2023  418 3827 33 11
> >> 55  1
> >>
> >> It looks as if the 2nd run only hit the cache, not the disk, while
> >> reading the directory (bi is very low)--if I understand the output,
> >> which is a big if.
> 
> I agree with your assumption.
> 
> > Here are some more stats, systematically varying the software.  I'll
> > just give the total backup time, starting with the 2 reported above:
> > 64
> > 14
> > Upgrade to Postgresql 8.2 from 8.1
> > 41
> > 13
> > upgrade to bacula 2.2.4
> > 13
> > 12
> 
> This helped a lot, I think, though it still might be the buffers 
> performance you're measuring. 
Yes, I think the good first time performance reflects the fact that the
disk sectors were already in memory.  That's why I tried a different
directory in the next test; it showed the first time performance right
back to very slow.
> Anyway, give the relative increase you 
> observe, I'm rather sure that at least part of it is due to Bacula 
> performing better.
> 
Comparing times with the disk apparently in the cache, it was 13 seconds
before the bacula upgrade and 13 and 12 seconds after.  That doesn't
seem like much evidence for improvement.  I presume the part the upgrade
should have helped (database transactions) is the same regardless of the
issues reading files to backup.


Since the first and 2nd runs share the same need to populate the
catalog, these results seem to show that the speed of the catalog is not
the issue.
> > switch to a new directory for source of backup
> > old one has 1,606 files = 24MB copressed
> > new one has 4,496 files = 27MB
> > 92
> > 22
> > 
With a new directory, first time is up to 92 seconds (slightly more MB,
but way more files than the earlier tests).
> > In the slow cases vmstat shows lots of blocks in and major (40%) CPU
> > time in iowait.
> > 
> > I suspect the relatively good first try time with Postgresql 8.2 was a
> > result of having some of the disk still in the cache.
> > 
> > Even the best transfer rates were not particularly impressive (1854
> > kb/s), but the difference between the first and 2nd runs (and the
> > sensitivity to the number of files) seems indicate that simply finding
> > and opening the files causes a huge slowdown.
> 
> Which is more or less to be expected. You can try to emulate this 
> using something like (untested!) 'time find ./ -type f -exec ls -l 
> \{\} >/dev/null \;' this should more or less measure the time needed 
> to walk the diretory tree and read the file metadata.
# time find upj -ls > /dev/null

real    0m3.383s
user    0m0.060s
sys     0m0.192s

That's the same directory as the earlier tests.  I figured I'd skip the
overhead of forking processes (-exec).

Of course, the backup needs to open and read the file as well as the
metadata.
> 
> > I guess it's not surprising that hitting the cache is faster than
> > hitting the disk, but the very slow speed of the disk is striking.
> 
> Yes and yes. How fast can you read the disk raw, i.e. 'dd 
> if=/dev/evms/CyrusSpool bs=4096 of=/dev/null' or something?
> 
> > The mount for that partition is
> > /dev/evms/CyrusSpool /var/spool/cyrus ext3    rw,noatime 0       2
> > The partition is sitting on top of a lot of layers, since it's from an
> > LVM container, with evms on top of that.
> 
> In my experience, that doesn't matter much. On an Athlon 500, with an 
> LVM volume using old PATA drives connected with a really horrible IDE 
> controller, I get a read rate of about 1GB/147s, ~7MB/s:
> 
> time dd if=/dev/mapper/daten-lvhome bs=4096 skip=16k count=256k \
> of=/dev/null
> 262144+0 records in
> 262144+0 records out
> 
> real    2m27.031s
> user    0m0.211s
> sys     0m11.142s
> 
My machine gives
# time dd if=/dev/evms/CyrusSpool bs=4096 skip=16k count=256k
of=/dev/null
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 26.4824 seconds, 40.5 MB/s
so the bottleneck isn't there.
> With up-to-date hardware, I'd expect much better throughput. This is 
> from a dualcore Opteron with a single SATA disk, also an LVM volume:
> 
> dd if=/dev/mapper/data-squid bs=4096 skip=16k count=256k of=/dev/null
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 14.9858 s, 71.7 MB/s
> 
> > Probably trying with the catalog on a different physical disk would be a
> > good idea.
> 
> That would be my next step at least.
With the catalog on the same disk as the  directory being backed up, I
suppose this could mess up the caching, both by the OS and the hardware.
Also, in the tests (though not in the real backup) the cyrus server was
active.  Both it and postgres are probably doing a lot of sync's, which
can't be helping performance.  It's hard to believe the effect could be
that big, but then again it's hard to believe anything coud make the
backup so slow.

Ross


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] painfully slow backups

Reply via email to