Re: [Bacula-users] painfully slow backups

Arno Lehmann Sun, 30 Sep 2007 14:19:50 -0700

Hello,

30.09.2007 01:36,, Ross Boylan wrote::
> On Sat, 2007-09-29 at 13:15 -0700, Ross Boylan wrote:
>> On Fri, 2007-09-28 at 08:46 +0200, Arno Lehmann wrote:
>>> Hello,
>>>
>>> 27.09.2007 22:47,, Ross Boylan wrote::
>>>> On Thu, 2007-09-27 at 09:19 +0200, Arno Lehmann wrote:
>>>>> Hi,
>>>>>
>>>>> 27.09.2007 01:17,, Ross Boylan wrote::
>>>>>> I've been having really slow backups (13 hours) when I backup a large
>>>>>> mail spool.  I've attached a run report.  There are about 1.4M files
>>>>>> with a compressed size of 4G.  I get much better throughput (e.g.,
>>>>>> 2,000KB/s vs 86KB/s for this job!) with other jobs.
>>>>> 2MB/s is still not especially fast for a backup to disk, I think. So 
>>>>> your storage disk might also be a factor here.
> .....
>>> vmstat during a backup would be a good next step in this case, I think.
>>>
>> Here are the results of a test job.  The first vmstat was shortly after
>> I started the job
>> # vmstat 15
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>> id wa
>>  1  2   7460  50760 204964 667288    0    0    43    32  197   15 18  5
>> 75  2
>>  1  1   6852  51476 195492 675524   28    0  1790   358  549 1876 20  6
>> 36 38
>>  0  2   6852  51484 189332 682612    0    0  1048   416  470 1321 12  4
>> 41 43
>>  2  0   6852  52508 187344 685328    0    0   303   353  485 1369 16  4
>> 68 12
>>  1  0   6852  52108 187352 685464    0    0     1   144  468 1987 12  4
>> 84  0
>>
>> Sorry for the bad wrapping.  This clearly shows about 40% of the CPU
>> time spent in IO wait during the backup.  Another 40% is idle.  I'm not
>> sure if the reports are being thrown off by the fact that I have 2
>> virtual CPU's (not really: it's P4 with hyperthreading).  If that's the
>> case, the 40% might really mean 80%.


Interesting question... I never thought about that, and the man page 
writers for vmstat on my system didn't either. I suppose that vmstat 
bases its output on the overall available CPU time, i.e. you have 40% 
of all available CPU time spent in IOwait. Like, one (HT) CPU spends 
80% waiting, the other no time at all.

>> During the run I observed little CPU or memory useage above where I was
>> before it.  None of the bacula daemons, postgres or bzip got anywhere
>> near the top of my cpu use list (using ksysguard).
>>
>> A second run went much faster: 14 seconds (1721.6 KB/s) vs 64 seconds
>> (376.6 KB/s) the first time.  Both are much better than I got with my
>> original, bigger jobs.  It was so quick I think vmstat missed it
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>> id wa
>>  1  0   6852  56496 184148 683932    0    0    43    32  197   19 18  5
>> 75  2
>>  3  0   6852  56016 178604 690024    0  113     0   429  524 3499 35 10
>> 55  0
>>  2  0   6852  51988 172476 701556    0    0     1  2023  418 3827 33 11
>> 55  1
>>
>> It looks as if the 2nd run only hit the cache, not the disk, while
>> reading the directory (bi is very low)--if I understand the output,
>> which is a big if.

I agree with your assumption.

> Here are some more stats, systematically varying the software.  I'll
> just give the total backup time, starting with the 2 reported above:
> 64
> 14
> Upgrade to Postgresql 8.2 from 8.1
> 41
> 13
> upgrade to bacula 2.2.4
> 13
> 12

This helped a lot, I think, though it still might be the buffers 
performance you're measuring. Anyway, give the relative increase you 
observe, I'm rather sure that at least part of it is due to Bacula 
performing better.

> switch to a new directory for source of backup
> old one has 1,606 files = 24MB copressed
> new one has 4,496 files = 27MB
> 92
> 22
> 
> In the slow cases vmstat shows lots of blocks in and major (40%) CPU
> time in iowait.
> 
> I suspect the relatively good first try time with Postgresql 8.2 was a
> result of having some of the disk still in the cache.
> 
> Even the best transfer rates were not particularly impressive (1854
> kb/s), but the difference between the first and 2nd runs (and the
> sensitivity to the number of files) seems indicate that simply finding
> and opening the files causes a huge slowdown.

Which is more or less to be expected. You can try to emulate this 
using something like (untested!) 'time find ./ -type f -exec ls -l 
\{\} >/dev/null \;' this should more or less measure the time needed 
to walk the diretory tree and read the file metadata.

> I guess it's not surprising that hitting the cache is faster than
> hitting the disk, but the very slow speed of the disk is striking.

Yes and yes. How fast can you read the disk raw, i.e. 'dd 
if=/dev/evms/CyrusSpool bs=4096 of=/dev/null' or something?

> The mount for that partition is
> /dev/evms/CyrusSpool /var/spool/cyrus ext3    rw,noatime 0       2
> The partition is sitting on top of a lot of layers, since it's from an
> LVM container, with evms on top of that.

In my experience, that doesn't matter much. On an Athlon 500, with an 
LVM volume using old PATA drives connected with a really horrible IDE 
controller, I get a read rate of about 1GB/147s, ~7MB/s:

time dd if=/dev/mapper/daten-lvhome bs=4096 skip=16k count=256k \
of=/dev/null
262144+0 records in
262144+0 records out

real    2m27.031s
user    0m0.211s
sys     0m11.142s

With up-to-date hardware, I'd expect much better throughput. This is 
from a dualcore Opteron with a single SATA disk, also an LVM volume:

dd if=/dev/mapper/data-squid bs=4096 skip=16k count=256k of=/dev/null
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 14.9858 s, 71.7 MB/s

> Probably trying with the catalog on a different physical disk would be a
> good idea.

That would be my next step at least.

Arno

> Ross
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] painfully slow backups

Reply via email to