Hello,
after some weeks of proper operation, my new bacula site causes
problems.
Problem description:
After setting up the site, job scheduling worked for about 2 weeks. Now,
quite often, the director hangs after starting the scheduled jobs at
night. Connecting via console is not possible anymore.
The only solution so far was restarting the director and manually
running the jobs. After a few days (maybe even the next night) the
problem occurs again.
Looking into the archives of this list, I found some similar problem
descriptions but none of the suggest things lead to a solution (at least
none reported).
The amazing thing is, that I have a second site with very similar
config but only another loader, that works fine for years now.
Now, I found out, that it could have come from a hanging sd, too. See
below why. Question is: why hangs the dir because the sd is hanging?
I haven't been able to produce a trace-file. I'll set debuglevel=100 and
trace=1 now and see if I can gather some information and post it to this
list as soon as possbile.
What I can provide (so far) is some other debugging output (see below).
Does anyone have ideas or comments on this odd behaviour??
Regards
Volker
Debugging-Session attached:
8<-------------------------------------------------------------------------
------
System: Debian Sarge Kernel 2.6.8-2-k7-smp
Bacula: 1.36.2-2sarge1
Storage: Overland 10x DLT with Quantum DLT40/80
Database: mysql 4.0.24-10 (Size: 1.4G)
dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling
for all machines except bacula-host itself.
-----------------------------------------------------
dakar: / 7# bconsole
Connecting to Director dakar:9101
Director authorization problem.
Most likely the passwords do not agree.
Please see
http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help.
-----------------------------------------------------
(passwords are definitely okay).
This is the conmesg-file so far:
-----------------------
dakar: /var/lib/bacula 33# cat backup-dir.conmsg
11-Jul 21:00 backup-dir: Start Backup JobId
899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir:
Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00
backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00
backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul
21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of
data. 11-Jul 21:00 backup-dir: Start Backup JobId 907,
Job=bali-rootfs.2005-07-11_21.00.08 ------------------------
The bali-fd is up and seems to okay. The director seems to wait for
something. Nothing else in this file
There's nothing in the syslog (SCSI error etc.), either:
----------------------------------------------------
Jul 11 20:40:30 dakar -- MARK --
Jul 11 21:00:30 dakar -- MARK --
Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD ( run-parts
--report /etc/cron.hourly)
Jul 11 21:40:30 dakar -- MARK --
Jul 11 22:00:30 dakar -- MARK --
Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD ( run-parts
--report /etc/cron.hourly)
Jul 11 22:40:30 dakar -- MARK --
Jul 11 23:00:30 dakar -- MARK --
-----------------------------------------------------
Here's what strace -p says:
on bacula-director-process:
dakar: ~ 2# strace -v -p 19051
Process 19051 attached - interrupt to quit
futex(0x80c5680, FUTEX_WAIT, 2, NULL
(running bconsole doesn't produce any additional output). It seems to
wait for something.
on bacula-sd-process:
dakar: ~ 3# strace -v -p 19099
Process 19099 attached - interrupt to quit
select(5, [4], NULL, NULL, NULL <unfinished ...>
Process 19099 detached
on bacula-fd on bali-fd:
bali: ~ 4# strace -v -p 3301
Process 3301 attached - interrupt to quit
select(4, [3], NULL, NULL, NULL <unfinished ...>
Process 3301 detached
on bacula-fd on paris-fd:
paris: ~# strace -v -p 14280
Process 14280 attached - interrupt to quit
select(4, [3], NULL, NULL, NULL
Something is quite strange about the bacula-fd on paris:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14280 root 16 0 263m 221m 2840 S 0.0 10.9 0:00.02
/usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
The bacula-fd is 263M in size!! Hell, why so much?? I don't know if
this has something to do with the director.
What I'll do right now is restart paris-fd and see what happens:
Results:
- bacula-fd on paris is normal size
- bconsole still doesn't connect
- backup-dir.conmsg contains messages about non-reachable paris:
11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
error: Network error with FD during Backup: ERR=No data available
11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
Network error with FD during Backup: ERR=No data available
11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
error: No Job status returned from FD.
11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
No Job status returned from FD.
11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula
1.36.2 (28Feb05): 11-Jul-2005 23:34:14
JobId: 900
Job: paris-netboot.2005-07-11_21.00.01
Backup Level: Incremental, since=2005-07-08 21:00:02
Client: paris-fd
FileSet: "paris-netboot" 2005-06-11 01:35:56
Pool: "DiffInc"
Storage: "DLT"
Start time: 11-Jul-2005 21:00:02
[.........]
bacula-dir still FUTEX_WAITing:
dakar: /var/lib/bacula 9# strace -p 19051
Process 19051 attached - interrupt to quit
futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...>
Well, so far. I will restart bacula-dir now und run some jobs manually:
.....
After running the jobs and waiting for 10 minutes nothing happens. the
bacula-dir still works and stat sd says:
Device status:
Device "/dev/tape" is mounted with Volume "DiffInc-03"
Device is BLOCKED waiting for appendable media.
Total Bytes Read=0 Blocks Read=0 Bytes/block=0
Positioned at File=19 Block=0
Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
max bytes/job.
Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
bytes.
I'll try unmount and mount:
*umount
Automatically selected Storage: DLT
3001 Device "/dev/tape" unmounted.
*
*mount
Automatically selected Storage: DLT
3001 Device /dev/tape is mounted with Volume "DiffInc-03"
*
But Again:
Device status:
Device "/dev/tape" is mounted with Volume "DiffInc-03"
Device is BLOCKED waiting for appendable media.
Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512
Positioned at File=0 Block=0
Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
max bytes/job.
Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
bytes.
This time the bacula-sd seems to be stuck. Maybe this causes the
bacula-fd after hours of waiting to hang???
I'm restarting sd:
12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job
922 canceled.
12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job
921 canceled.
12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal
error: Job 915 canceled.
12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal
error: Job 914 canceled.
12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913
canceled.
12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error:
job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data
12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error:
job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data
12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal
error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK
data
[ ... errors of broken jobs removed ...]
12-Jul 00:01 backup-dir: Start Backup JobId 918,
Job=paris-home.staff.3.2005-07-11_23.42.44
12-Jul 00:01 backup-dir: Start Backup JobId 919,
Job=paris-home.prak.2005-07-11_23.42.49
12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0"
command.
12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is
Slot 3.
12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving
to end of data.
*
*
12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at
file=19.
12-Jul 00:02 backup-sd: Spooling data ...
12-Jul 00:02 backup-sd: Spooling data ...
now it seems to work.... ok, all jobs finished.
Question is: is there a connection between a hanging sd (which I'm not
sure if it was hanging while the dir was hanging!?) and the hangig dir??
I'll try to get a trace file of the situation!
End Debugging Output
----------------------------------------------------------------------