On Tuesday 26 July 2005 04:10, Theron Toomey wrote:
Hi, thanks for the suggestions. Sorry it took me a few days to respond-
there's not much time for testing between daily backup cycles.
My current theory is that there is some strange corruption in my DB,
perhaps in the File table.
I'm not sure but I think this may be related to another problem I'm
having. Restores (using option 3 or 5) of very large jobs (around 200
GB) fail while writing the restore bootstrap. I suspect that while
reading my catalog to generate the restore.bsr, bacula is encountering
some corruption, which may also explain the strange garbage in my other
restores.
Could you send me the console and any output from this "fail" so I can
see what is going on.
This isn't necessarily pertinent but I have seen a couple interesting
results with these large restores, varying from the SD segfaulting
immediately to sitting in an infinite loop, eating about half the system
memory, and then segfaulting (no, its not using the tls lib). Here's a
gdb trace of the latter behavior if you are curious:
http://www.duke.edu/~ttoomey/misc/bacula-sd-debug.3.txt.gz
Could you send me the bootstrap file from this? When doing the restore
and it reaches the question yes/mod/no, it will have printed the location
of the bootstrap file just prior to issuing the prompt. Before answering
the prompt, you can copy it to another location (after answering the
prompt, it usually deletes the file).
After running dbcheck, it did cough up an error while restoring before
the SD died:
25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: Bootstrap
file error: expected an integer or a range, got T_EOL: =
: Line 5543394, col 10 of file
/var/bacula/fury.restore.2005-07-25_11.03.07.bootstrap
FileIndex=
25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: job.c:1662
Comm error with SD. bad response to Bootstrap. ERR=No data available
I plan on filing a bug about the SD issue after I do some more testing
to try and isolate the problem. I think, whatever the corruption is, it
should probably be handled more gracefully by the SD (if my theory is
right).
Yes, please do open a bug report -- preferrably one for each problem that
you consider unrelated. I am unable to adequately track and resolve
complicated problems such as this from emails.
Err.. anyway, please see below for my answers.
Martin Simmons wrote:
Theron> Hello,
Theron> I'm seeing some strange behavior with restores under
1.36.3/RHEL 3 using Theron> an AIT-3 drive. I'm not quite sure what is
causing it and I'd really Theron> appreciate any suggestions.
Theron> When I choose restore option 5 (Select the most recent backup)
bacula Theron> proceeds to restore data from the last full and
subsequent diff/incr Theron> jobs. However, for large restores (>50
GB), I notice a few dozen error Theron> messages like:
Theron> Error: attribs.c:339 File size of restored file /foo/bar not
correct. Theron> Original [file size], restored [large, bogus file
size].
Theron> Comparing the restores against the live data, I see that the
restored Theron> files have lots of random garbage inserted/appended to
them.
Theron> However, when I manually find the jobIDs of the
full/diffs/incrs and Theron> restore them individually with restore
option 3, there is no corruption Theron> and the files all seem fine.
Does "individually" mean one at a time, i.e. repeated use of option 3?
If so, do you get corruption if you enter all the jobIDs into a single
option 3 in the same order as bacula chose from option 5?
Yes, individually means one at a time with repeated use of option 3. If
I enter the same JobID's from option 5 into option 3, I see exactly the
same corruption on the same files as when I use option 5.
Theron> Most of the corrupt files are older than the last full;
Perhaps there's Theron> something in the diff/incr jobs that corrupts
the files from the full Theron> job. However, most of the corrupt files
are older than the last full and Theron> so are not even present in the
diff/incr jobs.
Theron> Has anyone seen behavior like this or have any ideas about
where to look?
For a particular restore, is it always the same files that are
corrupted? If yes, is the garbage really random or is it the same
garbage each time? Also, what happens if you use option 5 but only
mark one of the corrupted files for restore?
If I perform two identical restores, I see the same files corrupted with
the same garbage. The md5sums of the corresponding files from each
restore are a match so the garbage isn't random or at least its
dependent on something else.
If I mark just one of the corrupted files for restore, there is no
corruption in the file.
Can you send me the corresponding bootstrap files for these two cases so
that I can compare them.
What database are you using, and can you give me an idea how big it is?
Thanks for your help, I hope I'm on the right track.
Well, you are certainly doing the right things, but it is a bit early to
tell what the right track really is ...