Stefan and Arno,

Thanks for your replies, and pointing out that the recovery procedure is 
described in the manual. I had not spotted that.

    Alex

Arno Lehmann wrote:
Hello,

On 1/25/2006 5:36 PM, Alex Finch wrote:


I have spent the last few days setting up bacula. Everything was going fine till this afternoon. I was backing up a user's laptop when he turned it off. The next backup failed saying:

25-Jan 16:26 lapf-sd: Andres_Sopczaks_Laptop.2006-01-25_16.23.41 Error: I cannot write on Volume "LAN130" because:
The number of files mismatch! Volume=249 Catalog=248
25-Jan 16:26 lapf-sd: Marking Volume "LAN130" in Error in Catalog.


The previous backup ended thus:

25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Fatal error: Network error with FD during Backup: ERR=Connection timed out 25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Fatal error: No Job status returned from FD. 25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Error: Bacula 1.38.5 (18Jan06): 25-Jan-2006 14:18:47
  JobId:                  45
  Job:                    Roger_Jones_Laptop.2006-01-25_11.42.01
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "pyb047000004-fd" Windows XP,MVS,NT 5.1.2600
  FileSet:                "Roger Jones Laptop" 2006-01-25 11:42:03
  Pool:                   "Default"
  Storage:                "SONY Library"
  Scheduled time:         25-Jan-2006 11:41:51
  Start time:             25-Jan-2006 11:42:03
  End time:               25-Jan-2006 14:18:47
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0
  SD Bytes Written:       0
  Rate:                   0.0 KB/s
  Software Compression:   None
  Volume name(s):         LAN130
  Volume Session Id:      1
  Volume Session Time:    1138188988
  Last Volume Bytes:      236,723,950,457
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Running
  Termination:            *** Backup Error ***

=====================================================================================================================================================

 Can I

 a) recover the situation?


Hmm. Difficult question, because, in my opinion, the above should not have happened. Basically, though, there's not much to recover.

The actual problem is that Bacula has a different idea about how many file marks are on a volume and thus can't trust itself to position to the right tape position.

This usually only happens (or should happen) in cases where your drive seriously fails during writing, a tape fails during writing (in which case that doesn't matter anymore), the SD crashes while a job is active, the DIR crashes while a job is active, or the database or database connectivity crashes etc. pp. In other words, at the moment, I'd say you found a bug.

Usually, if a job is aborted with an error while its's running, the SD and the DIR work together correctly, so that the SD finishes writing data to the volume and writes the necessary file mark, while the DIR notices that fact in the catalog. So, usually the number of files on the volume is correctly noted in the catalog.

Now, how to recover?
You've got several possible solutions:
- Simply set the tape status to Used. It will be recycled as planned, and you will only lose part of it's capacity. Usually nothing serious. - Leave it in state Error. That tape would never be rewritten, and after some time you'd wonder why it's marked as defect, and you would probably destroy a perfect tape. I wouldn't do that. - Modify the catalog data to the correct number of files. The above messages indicate that the SD wrote the necessary file mark, but the catalog was not updated. So, if you know a little SQL and know a little about Baculas database schema, that's a simple task. You should know what you do, though. Afterwards, you could set the volume status to Append and it *should* be usable without any problems. You wouldn't lose any tape space. I would only do this if I'm short of tapes.

 b) prevent it happening again?


Not much to do. In fact, I'd try if you can reproduce that behaviour. Perhaps set up some test jobs, let bacula run with debugging output turned on (both the SD and the DIR) and break the test jobs in different ways: disconnect the network between FD and SD or kill the FD for example. See what happens.

If this can be reproduced I'd say it's a bug to fix.

Why? Because I have jobs that end in error on a regular basis - I back up one WLAN-connected notebook, and once in a while, that connection is dropped. The result is that the SD times out the job because it can't connect to the FD any more. BUT, and that is probably one big difference, I use spooling, so that the SD only starts writing to tape when all the data is available. And I use 1.38.4.

I'll see if I get 1.38.5 installed tomorrow and set up a test job without spooling...

 Is it a bug or a feature?


Definitely not a feature, I'd say.

Arno

        Alex Finch




--
 Alex Finch, Research Fellow, Physics Department, Lancaster University.


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to