On 1/28/14 4:22 AM, "Wolfgang Denk" <w...@denx.de> wrote:
>Hello, > >for some time, I'm facing the situation that 2 specific backup jobs >(out of a total of some 30+) fail every now and then with the error >message: > >Fatal error: askdir.c:340 NULL Volume name. This shouldn't happen!!! > >Re-running the jobs works fine. > >Backup is done to a LTO library with 2 tape drives. Here is an example >log that shows what is happening: > >1) I start two simultaneous backup jobs. > > *run level=Incremental pool=INCR Hercules-Root > Run Backup job > JobName: Hercules-Root > Level: Incremental > Client: hercules-fd > FileSet: Hercules Root > Pool: INCR (From User input) > Storage: LTOLIB (From Job resource) > When: 2014-01-28 09:41:11 > Priority: 10 > OK to run? (yes/mod/no): yes > Job queued. JobId=71138 > *run level=Incremental pool=INCR Pollux-Root > Run Backup job > JobName: Pollux-Root > Level: Incremental > Client: pollux-fd > FileSet: Pollux Root > Pool: INCR (From User input) > Storage: LTOLIB (From Job resource) > When: 2014-01-28 09:41:19 > Priority: 10 > OK to run? (yes/mod/no): yes > Job queued. JobId=71139 > >2) The DIR decides to allocate each job to one of the two available > tape drives. At this point of time, the following tapes are loaded > to the tape drives: > > drive 0: slot 26 > drive 1: slot 32 > > Both tapes are in the correct pool (INCR) and have status "Append". > > JobId 71138 will use drive 0, > JobId 71139 will use drive 1. > > 28-Jan 09:41 mneme-dir JobId 71138: Start Backup JobId 71138, >Job=Hercules-Root.2014-01-28_09.41.13_25 > 28-Jan 09:41 mneme-dir JobId 71138: Using Device "LTO3-0" to write. > 28-Jan 09:41 ltos-sd JobId 71138: 3307 Issuing autochanger "unload >slot 32, drive 1" command. > 28-Jan 09:41 mneme-dir JobId 71139: Start Backup JobId 71139, >Job=Pollux-Root.2014-01-28_09.41.21_26 > 28-Jan 09:41 mneme-dir JobId 71139: Using Device "LTO3-1" to write. > 28-Jan 09:43 ltos-sd JobId 71139: Fatal error: askdir.c:340 NULL >Volume name. This shouldn't happen!!! > 28-Jan 09:43 ltos-sd JobId 71139: Spooling data ... > 28-Jan 09:44 ltos-sd JobId 71139: Elapsed time=00:01:23, Transfer >rate=0 Bytes/second > >3) JobId 71139 fails with "NULL Volume name" error. Note that before > the tape originally loaded in this drive has been unloaded. > > 28-Jan 09:44 pollux-fd JobId 71139: Error: bsock.c:429 Write error >sending 8 bytes to Storage daemon:ltos.denx.de:9103: ERR=Connection reset >by peer > 28-Jan 09:44 pollux-fd JobId 71139: Fatal error: xattr.c:98 Network >send error to SD. ERR=Connection reset by peer > 28-Jan 09:44 mneme-dir JobId 71139: Error: Bacula mneme-dir 5.2.13 >(19Jan13): > Build OS: x86_64-redhat-linux-gnu redhat > JobId: 71139 > Job: Pollux-Root.2014-01-28_09.41.21_26 > Backup Level: Incremental, since=2014-01-28 00:06:14 > Client: "pollux-fd" 5.2.13 (19Jan13) >x86_64-redhat-linux-gnu,redhat,Cat) > FileSet: "Pollux Root" 2005-12-22 11:06:26 > Pool: "INCR" (From User input) > Catalog: "MyCatalog" (From Client resource) > Storage: "LTOLIB" (From Job resource) > Scheduled time: 28-Jan-2014 09:41:19 > Start time: 28-Jan-2014 09:41:23 > End time: 28-Jan-2014 09:44:47 > Elapsed time: 3 mins 24 secs > Priority: 10 > FD Files Written: 2 > SD Files Written: 0 > FD Bytes Written: 0 (0 B) > SD Bytes Written: 0 (0 B) > Rate: 0.0 KB/s > Software Compression: None > VSS: no > Encryption: no > Accurate: no > Volume name(s): > Volume Session Id: 275 > Volume Session Time: 1390085323 > Last Volume Bytes: 16,678,095,872 (16.67 GB) > Non-fatal FD errors: 1 > SD Errors: 1 > FD termination status: Error > SD termination status: Error > Termination: *** Backup Error *** > >4) The other (first started) job now also unloads the tape, and then > load the one previously unloaded from the other drive: > > 28-Jan 09:44 ltos-sd JobId 71138: 3307 Issuing autochanger "unload >slot 26, drive 0" command. > 28-Jan 09:44 ltos-sd JobId 71138: 3304 Issuing autochanger "load slot >32, drive 0" command. > 28-Jan 09:45 ltos-sd JobId 71138: 3305 Autochanger "load slot 32, >drive 0", status is OK. > 28-Jan 09:45 ltos-sd JobId 71138: Volume "INC007L3" previously >written, moving to end of data. > 28-Jan 09:47 ltos-sd JobId 71138: Ready to append to end of Volume >"INC007L3" at file=6. > 28-Jan 09:47 ltos-sd JobId 71138: Spooling data ... > > >To me it appears as if the DIR does not correctly take into account >which tape is loaded where. It sees two tape drives, assignes one of >them for the first starting job, but then decides not to use the >currently mounted tape (which would be perfectly fine from all >criteria like Pool, Status, Use Days etc. - actually will later be >loaded into the other drive to run more jobs), but to use the one >which is currently loaded in the _other_ drive. It starts to unload >from the other drive. > >Now the second job starts running and finds that someone pulled the >tape out from under it, and it fails. > > >Does my interpretation make sense? > >Is this a common problem, or am I doing something wrong? > > >The "interesting" thing is that it's always the same 2 jobs out of my >list which are candidates for this error. And it does not always >happen - maybe 2 times per week or so... > >All this is with 5.2.13 on Fedora 20 systems... > > >All help / ideas welcome. Thanks in advance. > >Best regards, > >Wolfgang Denk > >-- >DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel >HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany >Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: w...@denx.de >The following statement is not true. The previous statement is true. >>>>>> I posted a similar notice 10/31/2013. Beyond the no responses, there is no bug report which means no fix is pending. Signing up to make a bug report is also broken and has been mentioned on this list. No message is sent that allows for verification and setting up an account. I can suggest a work-around that reduces the incidence of tape stealing by the storage director from one job for another - set their job schedules with an offset. I have used a 15 minute offset for jobs that had started at the same time. I still have the "NULL Volume name" happening, but with less frequency. I just look for it and rerun the affected jobs. My current biggest issue is that the storage daemon doesn't see the job failure and release the tape resource unless I restart the bacula processes. There appears to be a bug report and a fix pending release for that issue. Patti Clark Linux System Administrator Oak Ridge National Laboratory ------------------------------------------------------------------------------ Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users