Hi list, we've been running a rather large enviroment for some time now and have had plenty of fun with Bacula. However, lately, as the load keeps going up, we see some problems again.
The most annoying things at the moment are stalled (?) jobs. The logs say that backup is done. We've been having some issues as far as our database goes. It's painfully slow at the moment and I'm afraid that is one of the causes, but other than really long periods of the director inserting, copying or updating records in the DB we haven't had any major issues. Things would be just slow, but they wouldn't entirely stall and block following jobs. Here's a job log for a job that seems to be hanging: =========================== 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior Full backup Job record found. 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior or suitable Full backup found in catalog. Doing FULL backup. 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Start Backup JobId 137275, Job=RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Using Device "SL500-1-Drive-2" 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/var/run is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/var/run 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/platform is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/platform 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/sbin is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/sbin 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/etc/svc/volatile is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/etc/svc/volatile 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/system/contract is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/system/contract 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/proc is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/proc 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/home is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/home 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/tmp is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/tmp 2009-04-03 04:15:03 RAL-SERV132 JobId 137275: Could not stat /zones/ral-con184/root/mnt/install: ERR=Not owner 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/dev is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/dev 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/net is a different filesystem. Will not descend from /zones/ral-con184 into /zones/ral-con184/root/net 2009-04-03 05:46:21 dss-bacula-sd JobId 137275: Job write elapsed time = 02:21:45, Transfer rate = 5.890 M bytes/second =========================== Appart from the timestamp for the "different filesystem" entries, which we don't really worry about right now, everything looks just peachy. Now a "stat dir" tells me that the job is still underway =========================== 137275 Full RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 is running =========================== So in the past with a situation like this I would have seen a INSERT or COPY buffing away job when running `top`. However I don't. Somewhere along the line the DB-jobs must have come to a stop or something because a `ps` does eventually show a bunch of COPYs. =========================== dss-bacula:~# ps aux | grep post | grep bacula postgres 15910 4.0 3.5 168304 141168 ? S Apr02 25:15 postgres: bacula bacula 127.0.0.1(51704) idle postgres 19899 0.0 0.9 199924 39504 ? S 03:24 0:02 postgres: bacula bacula 127.0.0.1(39674) COPY postgres 19917 0.0 0.9 199924 39468 ? S 03:24 0:02 postgres: bacula bacula 127.0.0.1(39682) COPY postgres 20675 0.0 0.7 199792 31104 ? S 04:08 0:01 postgres: bacula bacula 127.0.0.1(47605) COPY postgres 21115 0.0 0.9 199924 40092 ? S 04:21 0:01 postgres: bacula bacula 127.0.0.1(34362) idle postgres 21132 0.0 0.4 183272 18440 ? S 04:22 0:00 postgres: bacula bacula 127.0.0.1(34369) COPY postgres 22702 0.0 0.3 175076 13760 ? S 06:23 0:00 postgres: bacula bacula 127.0.0.1(46992) COPY postgres 22977 0.0 0.6 199792 25012 ? S 06:28 0:01 postgres: bacula bacula 127.0.0.1(59001) COPY postgres 23855 0.1 0.9 199920 39888 ? S 07:47 0:02 postgres: bacula bacula 127.0.0.1(42439) COPY =========================== Looks somewhat healthy to me, except that those jobs should be somewhere among the top ten in a top and should really be burning up cpu time. As you can see the above examples are in fact only excerpts. I am facing a total of six of these jobs at this very moment and I am somewhat afraid that they might not make it all the way into the database and will eventually turn out to be unusable. What I could do, of course, would be to run bscan afterwards and have it fix the DB-issues but that just can't be the good way. Anyways, I need some advice as to where start looking and debugging. We have talked to some postgres experts and will, as soon as resources are available, work on our database issues in order to get a boost from that side. But even if we end up somewhere around 200% improvement, which is in fact very probable, a job that now takes 6h to insert its files into the database will then still need about 2 hours to complete. I figure there might be more to it than just that. Any suggestion is appreciated. Regards Ronald -- Mit freundlichen Grüßen Ronald Buder Tel.: +49(351)440080 Fax: +49(351)4400818 Mobil: +49(179)3218366 Email: rbu...@proficom-ag.de web: www.proficom-ag.de profi.com AG business solutions Firmensitz: Stresemannplatz 3, 01309 Dresden Büro Berlin: Potsdamer Platz 11, 10785 Berlin Amtsgericht Dresden, HRB 23438 Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users