On Fri, 2007-09-28 at 08:46 +0200, Arno Lehmann wrote: > Hello, > > 27.09.2007 22:47,, Ross Boylan wrote:: > > On Thu, 2007-09-27 at 09:19 +0200, Arno Lehmann wrote: > >> Hi, > >> > >> 27.09.2007 01:17,, Ross Boylan wrote:: > >>> I've been having really slow backups (13 hours) when I backup a large > >>> mail spool. I've attached a run report. There are about 1.4M files > >>> with a compressed size of 4G. I get much better throughput (e.g., > >>> 2,000KB/s vs 86KB/s for this job!) with other jobs. > >> 2MB/s is still not especially fast for a backup to disk, I think. So > >> your storage disk might also be a factor here. > >> > >>> First, does it sound as if something is wrong? I suspect the number of > >>> files is the key thing, and the mail spool has lots of little files > >>> (it's used by Cyrus). Is this just life when you have lots of little > >>> files? > >>> > >>> Second, how can I figure out what the problem is? I do have some > >>> suspicions, but first some basics: > >>> ------------------------------------------------ > >>> everything is running on the same box > >>> 3GHz P4 with one SATA drive as the main drive and 4 older drives, one of > >>> which is the backup target. > >>> No noticeable CPU load or disk activity during the backup. I was > >>> compressing, but that doesn't show up noticeably for CPU use. > >> How much memory, and how is the memory usage during backups? > > 2G of RAM. I'll have to watch it to determine how much is in use. > > 2 GB sounds ok to me, but you might find that tuning the database > helps a bit. > > ... > >>> I am not using snapshotting because that feature is broken right now > >>> (nothiing to do with bacula). I shut down the cyrus server during the > >>> backup (desspite some errors in the log around my attempted shutdown, it > >>> seemed to have worked). > >>> > >>> My suspicion is that the TCP/IP transactions are all getting delayed > >>> (maybe to batch for sending) in a way that usually isn't noticeable, but > >>> is noticeable when doing lots of quick exchanges locally. > >> I don't know anything about issues with TCP delays, and I know Bacula > >> installations running smoothly on all sorts of hardware and different > >> OSes. > >> > >> I rather suspect the catalog to be the bottle-neck. > >> > >> Verifying this might be as easy as running vmstat while the job is > >> backed up and seeing if there is lots of iowait happening - this does > >> not necessarily show as hard disk activity. > > Would tcp induced delays also show up as iowait? > > I'm not sure, because I still don't know what sort of TCP delays this > would be. Iowait would probably show up if the network driver has to > wait for the network adapter to process operations. > > You could try to use some network benchmark to see if there are > throughput problems. > > >> Are your database and the mail spool on the same disk? This might > >> explain the slowness you encounter. > > Yes. > > Hmm... this can be a major problem. > > >> In this case, I'd suggest to upgrade to Bacula 2.2.4. For two reasons, > >> actually: There is a serious bug that will hit you one day, and which > >> is fixed in the current version. Second, the new batch inserts feature > >> would gain lots of speed if the database throughput really is the > >> bottle neck for you. > > I see 2.2.4 is in Debian unstable, so I should be able to pull it in. > > That would be great if it speeds things up. > > Please let us know if this upgrade alone has good results. > > ... > >>> ######## Cyrus > >>> ## really this needs more care: use snapshot, dump db to ascii > >> As far as I know, it's sufficient to dump cyrus' database. Given that > >> dump and a backup of your mail files, a correct cyrus database can be > >> easily regenerated. Snapshots would be a good thing, perhaps, but > >> you'd still have to explicitly dump the database as there is no > >> guarantee that the disk files of the database are always in a > >> consistent state. > > cyrus recommends the ascii dump to guard against version changes that > > would render the binary unusable. > > True, but if you restore to the same version of cyrus (actually, > that's the database version they use) this would not be the main > problem. Restoring to a different OS/distribution version should > definitely be done with the ascii dump. > > > http://cyrusimap.web.cmu.edu/twiki/bin/view/Cyrus/Backup has more. > > You're right: snapshots alone will not assure integrity. > > ..... > >> I'm really unsure about TCP problems, but the situation more or less > >> looks like the catalog backend would be your problem. Could you try to > >> have the catalog db on another machine? > > I've only got the one for now. > > vmstat during a backup would be a good next step in this case, I think. >
Here are the results of a test job. The first vmstat was shortly after I started the job # vmstat 15 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 2 7460 50760 204964 667288 0 0 43 32 197 15 18 5 75 2 1 1 6852 51476 195492 675524 28 0 1790 358 549 1876 20 6 36 38 0 2 6852 51484 189332 682612 0 0 1048 416 470 1321 12 4 41 43 2 0 6852 52508 187344 685328 0 0 303 353 485 1369 16 4 68 12 1 0 6852 52108 187352 685464 0 0 1 144 468 1987 12 4 84 0 Sorry for the bad wrapping. This clearly shows about 40% of the CPU time spent in IO wait during the backup. Another 40% is idle. I'm not sure if the reports are being thrown off by the fact that I have 2 virtual CPU's (not really: it's P4 with hyperthreading). If that's the case, the 40% might really mean 80%. During the run I observed little CPU or memory useage above where I was before it. None of the bacula daemons, postgres or bzip got anywhere near the top of my cpu use list (using ksysguard). A second run went much faster: 14 seconds (1721.6 KB/s) vs 64 seconds (376.6 KB/s) the first time. Both are much better than I got with my original, bigger jobs. It was so quick I think vmstat missed it procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 6852 56496 184148 683932 0 0 43 32 197 19 18 5 75 2 3 0 6852 56016 178604 690024 0 113 0 429 524 3499 35 10 55 0 2 0 6852 51988 172476 701556 0 0 1 2023 418 3827 33 11 55 1 It looks as if the 2nd run only hit the cache, not the disk, while reading the directory (bi is very low)--if I understand the output, which is a big if. Ross ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users