Hello, after some weeks of proper operation, my new bacula site causes problems.
Problem description: After setting up the site, job scheduling worked for about 2 weeks. Now, quite often, the director hangs after starting the scheduled jobs at night. Connecting via console is not possible anymore. The only solution so far was restarting the director and manually running the jobs. After a few days (maybe even the next night) the problem occurs again. Looking into the archives of this list, I found some similar problem descriptions but none of the suggest things lead to a solution (at least none reported). The amazing thing is, that I have a second site with very similar config but only another loader, that works fine for years now. Now, I found out, that it could have come from a hanging sd, too. See below why. Question is: why hangs the dir because the sd is hanging? I haven't been able to produce a trace-file. I'll set debuglevel=100 and trace=1 now and see if I can gather some information and post it to this list as soon as possbile. What I can provide (so far) is some other debugging output (see below). Does anyone have ideas or comments on this odd behaviour?? Regards Volker Debugging-Session attached: 8<------------------------------------------------------------------------------- System: Debian Sarge Kernel 2.6.8-2-k7-smp Bacula: 1.36.2-2sarge1 Storage: Overland 10x DLT with Quantum DLT40/80 Database: mysql 4.0.24-10 (Size: 1.4G) dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling for all machines except bacula-host itself. ----------------------------------------------------- dakar: / 7# bconsole Connecting to Director dakar:9101 Director authorization problem. Most likely the passwords do not agree. Please see http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help. ----------------------------------------------------- (passwords are definitely okay). This is the conmesg-file so far: ----------------------- dakar: /var/lib/bacula 33# cat backup-dir.conmsg 11-Jul 21:00 backup-dir: Start Backup JobId 899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir: Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00 backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00 backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul 21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of data. 11-Jul 21:00 backup-dir: Start Backup JobId 907, Job=bali-rootfs.2005-07-11_21.00.08 ------------------------ The bali-fd is up and seems to okay. The director seems to wait for something. Nothing else in this file There's nothing in the syslog (SCSI error etc.), either: ---------------------------------------------------- Jul 11 20:40:30 dakar -- MARK -- Jul 11 21:00:30 dakar -- MARK -- Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD ( run-parts --report /etc/cron.hourly) Jul 11 21:40:30 dakar -- MARK -- Jul 11 22:00:30 dakar -- MARK -- Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD ( run-parts --report /etc/cron.hourly) Jul 11 22:40:30 dakar -- MARK -- Jul 11 23:00:30 dakar -- MARK -- ----------------------------------------------------- Here's what strace -p says: on bacula-director-process: dakar: ~ 2# strace -v -p 19051 Process 19051 attached - interrupt to quit futex(0x80c5680, FUTEX_WAIT, 2, NULL (running bconsole doesn't produce any additional output). It seems to wait for something. on bacula-sd-process: dakar: ~ 3# strace -v -p 19099 Process 19099 attached - interrupt to quit select(5, [4], NULL, NULL, NULL <unfinished ...> Process 19099 detached on bacula-fd on bali-fd: bali: ~ 4# strace -v -p 3301 Process 3301 attached - interrupt to quit select(4, [3], NULL, NULL, NULL <unfinished ...> Process 3301 detached on bacula-fd on paris-fd: paris: ~# strace -v -p 14280 Process 14280 attached - interrupt to quit select(4, [3], NULL, NULL, NULL Something is quite strange about the bacula-fd on paris: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14280 root 16 0 263m 221m 2840 S 0.0 10.9 0:00.02 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf The bacula-fd is 263M in size!! Hell, why so much?? I don't know if this has something to do with the director. What I'll do right now is restart paris-fd and see what happens: Results: - bacula-fd on paris is normal size - bconsole still doesn't connect - backup-dir.conmsg contains messages about non-reachable paris: 11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal error: Network error with FD during Backup: ERR=No data available 11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error: Network error with FD during Backup: ERR=No data available 11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal error: No Job status returned from FD. 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error: No Job status returned from FD. 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula 1.36.2 (28Feb05): 11-Jul-2005 23:34:14 JobId: 900 Job: paris-netboot.2005-07-11_21.00.01 Backup Level: Incremental, since=2005-07-08 21:00:02 Client: paris-fd FileSet: "paris-netboot" 2005-06-11 01:35:56 Pool: "DiffInc" Storage: "DLT" Start time: 11-Jul-2005 21:00:02 [.........] bacula-dir still FUTEX_WAITing: dakar: /var/lib/bacula 9# strace -p 19051 Process 19051 attached - interrupt to quit futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...> Well, so far. I will restart bacula-dir now und run some jobs manually: ..... After running the jobs and waiting for 10 minutes nothing happens. the bacula-dir still works and stat sd says: Device status: Device "/dev/tape" is mounted with Volume "DiffInc-03" Device is BLOCKED waiting for appendable media. Total Bytes Read=0 Blocks Read=0 Bytes/block=0 Positioned at File=19 Block=0 Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898 max bytes/job. Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max bytes. I'll try unmount and mount: *umount Automatically selected Storage: DLT 3001 Device "/dev/tape" unmounted. * *mount Automatically selected Storage: DLT 3001 Device /dev/tape is mounted with Volume "DiffInc-03" * But Again: Device status: Device "/dev/tape" is mounted with Volume "DiffInc-03" Device is BLOCKED waiting for appendable media. Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512 Positioned at File=0 Block=0 Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898 max bytes/job. Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max bytes. This time the bacula-sd seems to be stuck. Maybe this causes the bacula-fd after hours of waiting to hang??? I'm restarting sd: 12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job 922 canceled. 12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job 921 canceled. 12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal error: Job 915 canceled. 12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal error: Job 914 canceled. 12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913 canceled. 12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK data , got 3903 Error append data 12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK data , got 3903 Error append data 12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK data [ ... errors of broken jobs removed ...] 12-Jul 00:01 backup-dir: Start Backup JobId 918, Job=paris-home.staff.3.2005-07-11_23.42.44 12-Jul 00:01 backup-dir: Start Backup JobId 919, Job=paris-home.prak.2005-07-11_23.42.49 12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving to end of data. * * 12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at file=19. 12-Jul 00:02 backup-sd: Spooling data ... 12-Jul 00:02 backup-sd: Spooling data ... now it seems to work.... ok, all jobs finished. Question is: is there a connection between a hanging sd (which I'm not sure if it was hanging while the dir was hanging!?) and the hangig dir?? I'll try to get a trace file of the situation! End Debugging Output ---------------------------------------------------------------------- -- Volker Sauer * Alexanderstrasse 39/217 * 64283 Darmstadt Telefon: 06151-154260 * Mobil: 0179-6901475 * ICQ#98164307 mailto:[EMAIL PROTECTED] * http://www.volker-sauer.de PGPKey-Fingerprint: DB2611C7B12E0B2739992E4F7E354E4D5DD5D0E0
signature.asc
Description: Digital signature