Hello Arno, I've been spending a lot of time thinking about this problem lately, trying to take in all the information you have supplied. For the moment, the most promising idea is what you found with unloaded drives. By the way, I am convinced that the problems Rick Knight are seeing are totally unrelated.
On Wednesday 14 December 2005 09:28, Arno Lehmann wrote: > Hello, > > some more results. I was curious, so I found the opportunity to verify > at least part of todays regular backup run. > > Arno Lehmann schrieb: > > Kern - > > > > Kern Sibbald schrieb: > >> Hmmm. I think I am going to pack it in and take a year's vacation ... > > > > don't. Yea, I won't, but I am sure getting annoyed with this problem. Please don't take that to mean I am annoyed with you, quite the contrary, you have been very patient. > > > > I was a little stupid when I tried to analyze the problem. Or rather, I > > didn't really analyze. > > > > What I did this evening: Shutdown bacula. Use mtx to load a tape into > > ONE of my TWO autoloaders (might be important later). > > Start Bacula again. > > > > Start a job going to autoloader ONE. > > > > VoilĂ . > > > > New results now: When Bacula encounters an empty autoloader drive at > > startup, I can't get it loading and using tapes. When the drive is > > loaded when Bacula (SD, probably) starts, everything's ok. Well, this is not at all good, but it is probably *very* important. If I remember right, you are using SUSE, can you remind me of what kernel you are on. I ask because with 2.6 kernels (actually 2.5.x) there was this major change that was discussed some time ago where if there is no tape in the drive, an open() in read/write mode waits for some period of time, then fails. Now, perhaps (very unlikely) on your system, it waits forever. Before anything else can work, Bacula needs to be able to open a tape drive correctly. On my system Fedora FC4 (kernel 2.6.14), it works fine. Perhaps at this point, I need to write a little test program that you can run for me that will report back what is going on. > > > > What I will / might / should test: > > - What happens with the empty autoloader TWO? I suspect it will remain > > blocked. > > - What happens if I umount ONE and later mount it again? > > I had to change tapes. The normal umount - change magazines - update > slots scan - mount worked. Jobs that go to the loaded volumes run, > others wait. No jobs failed until now, but up till now my original > problem situation didn't happen. > > > Tomorrow, I've got a job scheduled that should go to TWO - I'll report. > > This job is still waiting for Maximum Concurrent Jobs on the client... OK, Regards, Kern > > Arno > > > Arno > > > >> On Monday 12 December 2005 20:21, Arno Lehmann wrote: > >>> Hi, > >>> > >>> Arno Lehmann schrieb: > >>>> Hi, > >>>> > >>>> Kern Sibbald schrieb: > >>> > >>> ... > >>> > >>>>> Anyway, I would really appreciate it if you could try the 1.38.3 > >>>>> beta. Please use the released tar file rather than the CVS -- the > >>>>> HEAD CVS is now slightly behind the 1.38 branch in terms of fixes ... > >>>> > >>>> I'll download it in a few minutes... > >>>> > >>>>> Although I haven't been able to reproduce the pool problems you are > >>>>> having, I have been working on it nevertheless. When I was able to > >>>>> duplicate the two drive problem of not running jobs simultaneously on > >>>>> both drives, I took the time to review the reservation system in > >>>>> detail, and I found a few places where there *may* have been > >>>>> problems. > >>>> > >>>> I should see if the version 1.38.3beta fixes my problem on Tuesday - > >>>> that's still my jobs-to-different-pools-day. I'll report what happens. > >>>> And, of course, I've got one run of straight and usual backups before > >>>> that, although I don't think there will be any problems tomorrow. > >>> > >>> Unfortunately, I have to report something rather unexpected. The device > >>> reservation system works a little too good ;-) > >>> > >>> I started my bachup server this morning (or rather, it started itself). > >>> All tape drives Bacula itself loads were empty, magazines were filled > >>> with the correct volumes. Basically, the same situation I have 5 out of > >>> seven days for about a year now, with many versions of Bacula. > >>> > >>> Jobs were scheduled as usual and got triggered as usual, too. > >>> > >>> After that, not much progress since this morning, 8:20 (almost 12 hours > >>> > >>> ago): > >>>> 12-Dec 11:21 goblin-sd: Job DracheStd.2005-12-12_08.20.00 waiting to > >>>> reserve a device. 12-Dec 11:26 goblin-sd: Job > >>>> Goblin.2005-12-12_08.25.00 > >>>> waiting to reserve a device. 12-Dec 11:26 goblin-sd: Job > >>>> Ork.2005-12-12_08.25.03 waiting to reserve a device. 12-Dec 11:26 > >>>> goblin-sd: Job ElfSys.2005-12-12_08.25.01 waiting to reserve a device. > >>>> 12-Dec 11:26 goblin-sd: Job ElfHome.2005-12-12_08.25.02 waiting to > >>>> reserve a device. 12-Dec 15:21 goblin-sd: Job > >>>> DracheStd.2005-12-12_08.20.00 waiting to reserve a device. 12-Dec > >>>> 15:26 goblin-sd: Job Goblin.2005-12-12_08.25.00 waiting to reserve a > >>>> device. 12-Dec 15:26 goblin-sd: Job Ork.2005-12-12_08.25.03 waiting to > >>>> reserve a > >>>> device. 12-Dec 15:26 goblin-sd: Job ElfSys.2005-12-12_08.25.01 > >>>> waiting to > >>>> reserve a device. 12-Dec 15:26 goblin-sd: Job > >>>> ElfHome.2005-12-12_08.25.02 > >>>> waiting to reserve a device. > >>> > >>> these are the latest messages. > >>> > >>> Or, in other words, since I upgraded to 1.38.3 beta, nothing has been > >>> backed up at all. > >>> > >>> The current director status is > >>> > >>>> #sta dir > >>>> goblin-dir Version: 1.38.3 (09 December 2005) i586-pc-linux-gnu suse > >>>> 8.1 > >>>> Daemon started 12-Dec-05 07:32, 1 Job run since started. > >>>> > >>>> Scheduled Jobs: > >>>> Level Type Pri Scheduled Name > >>>> Volume > >>>> ====================================================================== > >>>>=== > >>>> > >>>> ========== Incremental Backup 10 13-Dec-05 08:20 > >>>> DracheStd DLT-IV-0063 Incremental Backup 10 13-Dec-05 > >>>> 08:20 BeowulfStd > >>>> DLT-IV-0063 Differential Backup 10 13-Dec-05 08:25 > >>>> Goblin DLT-IV-0063 Differential Backup 10 13-Dec-05 > >>>> 08:25 ElfSys DLT-IV-0063 Differential Backup 10 > >>>> 13-Dec-05 08:25 ElfHome DLT-IV-0063 Differential > >>>> Backup 10 13-Dec-05 08:25 Ork *unknown* > >>>> Differential Backup 30 13-Dec-05 08:25 GoblinDB > >>>> DLT-IV-0063 > >>>> Differential Backup 10 13-Dec-05 08:30 BackupMail > >>>> QIC-525-0046 Full Backup 100 13-Dec-05 08:33 > >>>> BackupCatalog DAT-120-0006 Admin 999 13-Dec-05 09:00 > >>>> Shutdown > >>>> ==== > >>>> > >>>> Running Jobs: > >>>> JobId Level Name Status > >>>> ====================================================================== > >>>> 4443 Increme DracheStd.2005-12-12_08.20.00 is waiting on Storage > >>>> HPDAT > >>>> 4445 Increme Goblin.2005-12-12_08.25.00 is waiting on Storage HPDAT > >>>> 4446 Increme ElfSys.2005-12-12_08.25.01 is waiting on Storage HPDAT > >>>> 4447 Increme ElfHome.2005-12-12_08.25.02 is waiting on Storage HPDAT > >>>> 4448 Increme Ork.2005-12-12_08.25.03 is waiting on Storage HPDAT > >>>> 4449 Increme GoblinDB.2005-12-12_08.25.04 is waiting for higher > >>>> priority jobs to finish 4450 Differe BackupMail.2005-12-12_08.30.00 > >>>> is waiting on max Client jobs 4451 Full > >>>> BackupCatalog.2005-12-12_08.34.00 > >>>> is waiting execution 4452 Shutdown.2005-12-12_09.00.00 is > >>>> waiting > >>>> execution ==== > >>>> > >>>> Terminated Jobs: > >>>> JobId Level Files Bytes Status Finished Name > >>>> ====================================================================== > >>>>== > >>>> > >>>> 4434 Incr 152 5,453,504 OK 11-Dec-05 08:54 Goblin > >>>> 4435 Incr 277 323,115,310 OK 11-Dec-05 08:59 ElfSys > >>>> 4437 Incr 880 302,413,714 OK 11-Dec-05 08:59 Ork > >>>> 4439 Diff 872 17,081,058 OK 11-Dec-05 09:01 > >>>> BackupMail 4432 Incr 1,082 1,506,827,769 OK 11-Dec-05 > >>>> 09:47 > >>>> DracheStd 4438 Incr 2 755,036 OK 11-Dec-05 > >>>> 09:47 GoblinDB 4440 Full 16 1,704,566,707 OK 11-Dec-05 > >>>> 10:31 BackupCatalog 4441 0 0 OK > >>>> 11-Dec-05 10:37 Shutdown 4442 0 0 OK > >>>> 11-Dec-05 23:58 Shutdown 4444 Incr 0 0 Error > >>>> 12-Dec-05 08:20 BeowulfStd > >>>> > >>>> ==== > >>> > >>> The last job mentioned failed because the run before script I use found > >>> the host to be down and exited with a non-zero status. This is expected > >>> and usual behaviour in case of the windows workstations I back up. > >>> > >>> The SD status: > >>>> #setdebug level=200 sd=HPDAT > >>>> Connecting to Storage daemon HPDAT at goblin:9103 > >>>> 3000 OK setdebug=200 > >>>> #sta sd=HPDAT > >>>> Connecting to Storage daemon HPDAT at goblin:9103 > >>>> > >>>> goblin-sd Version: 1.38.3 (09 December 2005) i586-pc-linux-gnu suse > >>>> 8.1 Daemon started 12-Dec-05 07:32, 0 Jobs run since started. > >>>> Heap: bytes=50,665 max_bytes=116,676 bufs=241 max_bufs=249 > >>>> > >>>> Running Jobs: > >>>> No Jobs running. > >>>> ==== > >>>> > >>>> Terminated Jobs: > >>>> JobId Level Files Bytes Status Finished Name > >>>> ====================================================================== > >>>> 4422 Incr 0 0 OK 10-Dec-05 10:31 > >>>> DracheStd > >>>> 4430 Full 16 1,704,173,700 OK 10-Dec-05 11:07 > >>>> BackupCatalog 4436 Incr 6 81,815,937 OK 11-Dec-05 > >>>> 08:54 ElfHome 4434 Incr 152 5,467,497 OK 11-Dec-05 > >>>> 08:54 Goblin 4435 Incr 277 323,149,789 OK 11-Dec-05 > >>>> 08:59 ElfSys 4437 Incr 880 302,514,999 OK 11-Dec-05 > >>>> 08:59 Ork 4439 Diff 872 17,180,372 OK 11-Dec-05 > >>>> 09:01 BackupMail 4432 Incr 1,082 1,507,075,010 OK > >>>> 11-Dec-05 09:47 > >>>> DracheStd 4438 Incr 2 755,274 OK 11-Dec-05 > >>>> 09:47 GoblinDB 4440 Full 16 1,704,568,641 OK 11-Dec-05 > >>>> 10:31 BackupCatalog ==== > >>>> > >>>> Device status: > >>>> Autochanger "HP DAT" with devices: > >>>> "HP DAT 0" (/dev/nst1) > >>>> Autochanger "FastStor 4000" with devices: > >>>> "Quantum DLT 0" (/dev/nst2) > >>>> Device "GoblinFile" (/var/bacula/) is not open or does not exist. > >>>> Configured device capabilities: > >>>> EOF BSR BSF FSR FSF EOM !REM RACCESS AUTOMOUNT LABEL !ANONVOLS > >>>> !ALWAYSOPEN Device state: > >>>> !OPENED !TAPE !LABEL MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL > >>>> !SHORT > >>>> !MOUNTED num_writers=0 JobStatus=C block=0 > >>>> > >>>> Device parameters: > >>>> Archive name: /var/bacula/ Device name: GoblinFile > >>>> File=0 block=0 > >>>> Min block=0 Max block=0 > >>>> Device "Wangtek 6525" (/dev/nst0) is mounted with Volume > >>>> "QIC-525-0046" Configured device capabilities: > >>>> EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT !LABEL !ANONVOLS > >>>> ALWAYSOPEN Device state: > >>>> OPENED TAPE LABEL MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL !SHORT > >>>> !MOUNTED num_writers=0 JobStatus=C block=0 > >>>> > >>>> Device parameters: > >>>> Archive name: /dev/nst0 Device name: Wangtek 6525 > >>>> File=0 block=0 > >>>> Min block=1024 Max block=1024 > >>>> Total Bytes Read=0 Blocks Read=0 Bytes/block=0 > >>>> Positioned at File=0 Block=0 > >>>> Device "HP DAT 0" (/dev/nst1) open but no Bacula volume is mounted. > >>>> Drive 0 is not loaded. > >>>> Configured device capabilities: > >>>> EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT !LABEL !ANONVOLS > >>>> ALWAYSOPEN Device state: > >>>> OPENED TAPE !LABEL MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL > >>>> !SHORT !MOUNTED num_writers=0 JobStatus=C block=0 > >>>> > >>>> Device parameters: > >>>> Archive name: /dev/nst1 Device name: HP DAT 0 > >>>> File=0 block=0 > >>>> Min block=0 Max block=0 > >>>> Total Bytes Read=0 Blocks Read=0 Bytes/block=0 > >>>> Positioned at File=0 Block=0 > >>>> Device "Quantum DLT 0" (/dev/nst2) open but no Bacula volume is > >>>> mounted. > >>>> Drive 0 is not loaded. > >>>> Configured device capabilities: > >>>> EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT !LABEL !ANONVOLS > >>>> ALWAYSOPEN Device state: > >>>> OPENED TAPE !LABEL MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL > >>>> !SHORT !MOUNTED num_writers=0 JobStatus=C block=0 > >>>> > >>>> Device parameters: > >>>> Archive name: /dev/nst2 Device name: Quantum DLT 0 > >>>> File=0 block=0 > >>>> Min block=0 Max block=0 > >>>> Total Bytes Read=0 Blocks Read=0 Bytes/block=0 > >>>> Positioned at File=0 Block=0 > >>>> ==== > >>>> > >>>> In Use Volume status: > >>>> QIC-525-0046 on device "Wangtek 6525" (/dev/nst0) > >>>> ==== > >>> > >>> In other words, the SD seems to be sure it doesn't have to do anything. > >>> > >>> This morning, when I noticed that behaviour, I used mtx to load a tape > >>> > >>> into drive HPDAT and issued the proper mount command. I do the same now: > >>>> #mount HPDAT > >>>> 3905 Device "HP DAT 0" (/dev/nst1) open but no Bacula volume is > >>>> mounted. > >>>> If this is not a blank tape, try unmounting and remounting the Volume. > >>>> # > >>> > >>> This is hardly true - that same tape was ok up till yesterday. Apart > >>> > >>> from that, the SD doesn't even try to read the tape: > >>>> goblin:~ # tapeinfo -f /dev/sg4 > >>>> Product Type: Tape Drive > >>>> Vendor ID: 'HP ' > >>>> Product ID: 'C1553A ' > >>>> Revision: '9503' > >>>> Attached Changer: No > >>>> MinBlock:1 > >>>> MaxBlock:16777215 > >>>> SCSI ID: 5 > >>>> SCSI LUN: 0 > >>>> Ready: yes > >>>> BufferedMode: yes > >>>> Medium Type: Not Loaded > >>>> Density Code: 0x24 > >>>> BlockSize: 0 > >>>> DataCompEnabled: yes > >>>> DataCompCapable: yes > >>>> DataDeCompEnabled: yes > >>>> CompType: 0x20 > >>>> DeCompType: 0x0 > >>>> BOP: yes > >>>> Block Position: 0 > >>>> goblin:~ # > >>> > >>> Or, in other words, the tape which is loaded is still at BOT which > >>> indicates that the SD didn't actually try to read it. > >>> > >>> The debug log is filled with many thousand lines mostly created by > >>> reserve.c in case of the mount command. > >>> > >>> Any clues? > >>> > >>> Arno > >>> > >>>>> Also, if your problem originated from failed jobs where the Dir was > >>>>> unable to contact a client, then it was most likely related to bug # > >>>>> 508, which is now fixed. > >>>> > >>>> No, I don't think that's the problem. > >>>> > >>>>> Oh, yes, I have now corrected the documentation for 1.38.3 for the > >>>>> Maximum Changer Wait specification. > >>>> > >>>> Great. Thanks. > >>>> > >>>> Arno ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_idv37&alloc_id865&op=click _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users