Stefan and Arno,
Thanks for your replies, and pointing out that the recovery procedure is
described in the manual. I had not spotted that.
Alex
Arno Lehmann wrote:
Hello,
On 1/25/2006 5:36 PM, Alex Finch wrote:
I have spent the last few days setting up bacula. Everything was
going fine till this afternoon. I was backing up a user's laptop when
he turned it off. The next backup failed saying:
25-Jan 16:26 lapf-sd: Andres_Sopczaks_Laptop.2006-01-25_16.23.41
Error: I cannot write on Volume "LAN130" because:
The number of files mismatch! Volume=249 Catalog=248
25-Jan 16:26 lapf-sd: Marking Volume "LAN130" in Error in Catalog.
The previous backup ended thus:
25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Fatal
error: Network error with FD during Backup: ERR=Connection timed out
25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Fatal
error: No Job status returned from FD.
25-Jan 14:18 lapf-dir: Roger_Jones_Laptop.2006-01-25_11.42.01 Error:
Bacula 1.38.5 (18Jan06): 25-Jan-2006 14:18:47
JobId: 45
Job: Roger_Jones_Laptop.2006-01-25_11.42.01
Backup Level: Full (upgraded from Incremental)
Client: "pyb047000004-fd" Windows XP,MVS,NT 5.1.2600
FileSet: "Roger Jones Laptop" 2006-01-25 11:42:03
Pool: "Default"
Storage: "SONY Library"
Scheduled time: 25-Jan-2006 11:41:51
Start time: 25-Jan-2006 11:42:03
End time: 25-Jan-2006 14:18:47
Priority: 10
FD Files Written: 0
SD Files Written: 0
FD Bytes Written: 0
SD Bytes Written: 0
Rate: 0.0 KB/s
Software Compression: None
Volume name(s): LAN130
Volume Session Id: 1
Volume Session Time: 1138188988
Last Volume Bytes: 236,723,950,457
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: Error
SD termination status: Running
Termination: *** Backup Error ***
=====================================================================================================================================================
Can I
a) recover the situation?
Hmm. Difficult question, because, in my opinion, the above should not
have happened. Basically, though, there's not much to recover.
The actual problem is that Bacula has a different idea about how many
file marks are on a volume and thus can't trust itself to position to
the right tape position.
This usually only happens (or should happen) in cases where your drive
seriously fails during writing, a tape fails during writing (in which
case that doesn't matter anymore), the SD crashes while a job is active,
the DIR crashes while a job is active, or the database or database
connectivity crashes etc. pp. In other words, at the moment, I'd say you
found a bug.
Usually, if a job is aborted with an error while its's running, the SD
and the DIR work together correctly, so that the SD finishes writing
data to the volume and writes the necessary file mark, while the DIR
notices that fact in the catalog. So, usually the number of files on the
volume is correctly noted in the catalog.
Now, how to recover?
You've got several possible solutions:
- Simply set the tape status to Used. It will be recycled as planned,
and you will only lose part of it's capacity. Usually nothing serious.
- Leave it in state Error. That tape would never be rewritten, and after
some time you'd wonder why it's marked as defect, and you would probably
destroy a perfect tape. I wouldn't do that.
- Modify the catalog data to the correct number of files. The above
messages indicate that the SD wrote the necessary file mark, but the
catalog was not updated. So, if you know a little SQL and know a little
about Baculas database schema, that's a simple task. You should know
what you do, though. Afterwards, you could set the volume status to
Append and it *should* be usable without any problems. You wouldn't lose
any tape space. I would only do this if I'm short of tapes.
b) prevent it happening again?
Not much to do. In fact, I'd try if you can reproduce that behaviour.
Perhaps set up some test jobs, let bacula run with debugging output
turned on (both the SD and the DIR) and break the test jobs in different
ways: disconnect the network between FD and SD or kill the FD for
example. See what happens.
If this can be reproduced I'd say it's a bug to fix.
Why? Because I have jobs that end in error on a regular basis - I back
up one WLAN-connected notebook, and once in a while, that connection is
dropped. The result is that the SD times out the job because it can't
connect to the FD any more. BUT, and that is probably one big
difference, I use spooling, so that the SD only starts writing to tape
when all the data is available. And I use 1.38.4.
I'll see if I get 1.38.5 installed tomorrow and set up a test job
without spooling...
Is it a bug or a feature?
Definitely not a feature, I'd say.
Arno
Alex Finch
--
Alex Finch, Research Fellow, Physics Department, Lancaster University.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users