Am Friday 03 April 2009 08:29:42 schrieb Ronald Buder: > Hi list, Sorry,
I forgot to add some of the most important information: We're running a 2.4.4 environment. The server is a Debian Etch, the DB a Postgres 8.1 We've been wanting to run a dist-upgrade to Lenny but haven't really found the time and guts to do that yet. Clients are all over the place. Anywhere from 2.2.7 to 2.4.4, quite a few different operating systems (Windows, Linux, Solaris, AIX, HP-UX), each of which in several different releases. The most recently hanging jobs are in fact 2.2.8 clients on Sparc Solaris 10. But there's no general rule as to what Client - Server - OS - Combination causes trouble. It really looks like a weird load issue. Thanks in advance for suggestions... Regards, Ronald > > we've been running a rather large enviroment for some time now and have > had plenty of fun with Bacula. However, lately, as the load keeps going > up, we see some problems again. > > The most annoying things at the moment are stalled (?) jobs. The logs > say that backup is done. We've been having some issues as far as our > database goes. It's painfully slow at the moment and I'm afraid that is > one of the causes, but other than really long periods of the director > inserting, copying or updating records in the DB we haven't had any > major issues. Things would be just slow, but they wouldn't entirely > stall and block following jobs. > > Here's a job log for a job that seems to be hanging: > > =========================== > 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior Full backup > Job record found. > 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior or suitable > Full backup found in catalog. Doing FULL backup. > 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Start Backup JobId > 137275, Job=RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 > 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Using Device > "SL500-1-Drive-2" > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/var/run is a different filesystem. Will not > descend from /zones/ral-con184 into /zones/ral-con184/root/var/run > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/platform is a different filesystem. Will not > descend from /zones/ral-con184 into /zones/ral-con184/root/platform > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/sbin is a different filesystem. Will not descend > from /zones/ral-con184 into /zones/ral-con184/root/sbin > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/etc/svc/volatile is a different filesystem. Will > not descend from /zones/ral-con184 into > /zones/ral-con184/root/etc/svc/volatile > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/system/contract is a different filesystem. Will > not descend from /zones/ral-con184 into > /zones/ral-con184/root/system/contract > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/proc is a different filesystem. Will not descend > from /zones/ral-con184 into /zones/ral-con184/root/proc > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: > /zones/ral-con184/root/home is a different filesystem. Will not descend > from /zones/ral-con184 into /zones/ral-con184/root/home > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/tmp > is a different filesystem. Will not descend from /zones/ral-con184 into > /zones/ral-con184/root/tmp > 2009-04-03 04:15:03 RAL-SERV132 JobId 137275: Could not stat > /zones/ral-con184/root/mnt/install: ERR=Not owner > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/dev > is a different filesystem. Will not descend from /zones/ral-con184 into > /zones/ral-con184/root/dev > 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/net > is a different filesystem. Will not descend from /zones/ral-con184 into > /zones/ral-con184/root/net > 2009-04-03 05:46:21 dss-bacula-sd JobId 137275: Job write elapsed time = > 02:21:45, Transfer rate = 5.890 M bytes/second > =========================== > > Appart from the timestamp for the "different filesystem" entries, which > we don't really worry about right now, everything looks just peachy. > > Now a "stat dir" tells me that the job is still underway > > =========================== > 137275 Full RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 is running > =========================== > > So in the past with a situation like this I would have seen a INSERT or > COPY buffing away job when running `top`. However I don't. Somewhere > along the line the DB-jobs must have come to a stop or something because > a `ps` does eventually show a bunch of COPYs. > > =========================== > dss-bacula:~# ps aux | grep post | grep bacula > postgres 15910 4.0 3.5 168304 141168 ? S Apr02 25:15 postgres: bacula > bacula 127.0.0.1(51704) idle > postgres 19899 0.0 0.9 199924 39504 ? S 03:24 0:02 postgres: bacula > bacula 127.0.0.1(39674) COPY > postgres 19917 0.0 0.9 199924 39468 ? S 03:24 0:02 postgres: bacula > bacula 127.0.0.1(39682) COPY > postgres 20675 0.0 0.7 199792 31104 ? S 04:08 0:01 postgres: bacula > bacula 127.0.0.1(47605) COPY > postgres 21115 0.0 0.9 199924 40092 ? S 04:21 0:01 postgres: bacula > bacula 127.0.0.1(34362) idle > postgres 21132 0.0 0.4 183272 18440 ? S 04:22 0:00 postgres: bacula > bacula 127.0.0.1(34369) COPY > postgres 22702 0.0 0.3 175076 13760 ? S 06:23 0:00 postgres: bacula > bacula 127.0.0.1(46992) COPY > postgres 22977 0.0 0.6 199792 25012 ? S 06:28 0:01 postgres: bacula > bacula 127.0.0.1(59001) COPY > postgres 23855 0.1 0.9 199920 39888 ? S 07:47 0:02 postgres: bacula > bacula 127.0.0.1(42439) COPY > =========================== > > Looks somewhat healthy to me, except that those jobs should be somewhere > among the top ten in a top and should really be burning up cpu time. As > you can see the above examples are in fact only excerpts. I am facing a > total of six of these jobs at this very moment and I am somewhat afraid > that they might not make it all the way into the database and will > eventually turn out to be unusable. What I could do, of course, would be > to run bscan afterwards and have it fix the DB-issues but that just > can't be the good way. > > Anyways, I need some advice as to where start looking and debugging. We > have talked to some postgres experts and will, as soon as resources are > available, work on our database issues in order to get a boost from that > side. But even if we end up somewhere around 200% improvement, which is > in fact very probable, a job that now takes 6h to insert its files into > the database will then still need about 2 hours to complete. I figure > there might be more to it than just that. > > Any suggestion is appreciated. > > Regards > > Ronald > > -- > Mit freundlichen Grüßen > > Ronald Buder > > Tel.: +49(351)440080 > Fax: +49(351)4400818 > Mobil: +49(179)3218366 > Email: rbu...@proficom-ag.de > web: www.proficom-ag.de > > profi.com AG business solutions > Firmensitz: Stresemannplatz 3, 01309 Dresden > Büro Berlin: Potsdamer Platz 11, 10785 Berlin > > Amtsgericht Dresden, HRB 23438 > Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise > > > --------------------------------------------------------------------------- >--- _______________________________________________ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users -- Mit freundlichen Grüßen Ronald Buder Tel.: +49(351)440080 Fax: +49(351)4400818 Mobil: +49(179)3218366 Email: rbu...@proficom-ag.de web: www.proficom-ag.de profi.com AG business solutions Firmensitz: Stresemannplatz 3, 01309 Dresden Büro Berlin: Potsdamer Platz 11, 10785 Berlin Amtsgericht Dresden, HRB 23438 Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users