Re: [Bacula-users] corrupt files on full restore

Kern Sibbald Fri, 05 Aug 2005 13:48:05 -0700

On Friday 05 August 2005 21:29, Theron Toomey wrote:
> Thanks Kern,
> Changing the names of the daemons solved the problem. Given the strange
> behavior, it wouldn't have occurred to me that was the cause but it
> makes perfect sense.


Thanks for the feedback. It is always nice to have a confirmation.

Yes, the downside of having the same names isn't so obvious. I've updated the 
doc to have an explicit warning about it.

>
> Kern Sibbald wrote:
> > On Thursday 04 August 2005 19:42, Theron Toomey wrote:
> >>Hello,
> >>I initially thought this problem was due to corruption in my database.
> >>However, the behavior seems to be caused by the SD, FD, and DIR sharing
> >>a working directory. When I assign the FD a different working-dir from
> >>the DIR/SD (e.g. WorkingDirectory = "/var/bacula/fd"), my restores work
> >>perfectly.
> >
> > Yes, thanks for figuring this out.  The problem is that you did not as
> > recommended, and as is the default, give your daemons unique names. I'll
> > improve the documentation on this.
> >
> >>I have opened a bug on this here:
> >>http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000398
> >>
> >>GDB tracebacks, errors, conf files, and bconsole output demonstrating
> >>the problem:
> >>http://www.duke.edu/~ttoomey/misc/ttoomey-bacula-wd-dbg.20050803.tar.gz
> >>
> >>As a workaround, I have separated my FD working dir from the DIR/SD.
> >>Curiously, when I separate the DIR and SD working dirs (so each daemon
> >>has its own dir), my autochanger stops working. That's a different issue
> >>though, and one that I haven't had time to investigate.
> >>
> >>Thanks for all your help.
> >>
> >>Kern Sibbald wrote:
> >>>On Tuesday 26 July 2005 04:10, Theron Toomey wrote:
> >>>>Hi, thanks for the suggestions. Sorry it took me a few days to respond-
> >>>>there's not much time for testing between daily backup cycles.
> >>>>
> >>>>My current theory is that there is some strange corruption in my DB,
> >>>>perhaps in the File table.
> >>>>
> >>>>I'm not sure but I think this may be related to another problem I'm
> >>>>having. Restores (using option 3 or 5) of very large jobs (around 200
> >>>>GB) fail while writing the restore bootstrap. I suspect that while
> >>>>reading my catalog to generate the restore.bsr, bacula is encountering
> >>>>some corruption, which may also explain the strange garbage in my other
> >>>>restores.
> >>>
> >>>Could you send me the console and any output from this "fail" so I can
> >>>see what is going on.
> >>>
> >>>>This isn't necessarily pertinent but I have seen a couple interesting
> >>>>results with these large restores, varying from the SD segfaulting
> >>>>immediately to sitting in an infinite loop, eating about half the
> >>>> system memory, and then segfaulting (no, its not using the tls lib).
> >>>> Here's a gdb trace of the latter behavior if you are curious:
> >>>>http://www.duke.edu/~ttoomey/misc/bacula-sd-debug.3.txt.gz
> >>>
> >>>Could you send me the bootstrap file from this?  When doing the restore
> >>>and it reaches the question yes/mod/no, it will have printed the
> >>> location of the bootstrap file just prior to issuing the prompt. Before
> >>> answering the prompt, you can copy it to another location (after
> >>> answering the prompt, it usually deletes the file).
> >>>
> >>>>After running dbcheck, it did cough up an error while restoring before
> >>>>the SD died:
> >>>>25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: Bootstrap
> >>>>file error: expected an integer or a range, got T_EOL: =
> >>>>
> >>>>  : Line 5543394, col 10 of file
> >>>>
> >>>>/var/bacula/fury.restore.2005-07-25_11.03.07.bootstrap
> >>>>FileIndex=
> >>>>25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: job.c:1662
> >>>>Comm error with SD. bad response to Bootstrap. ERR=No data available
> >>>>
> >>>>I plan on filing a bug about the SD issue after I do some more testing
> >>>>to try and isolate the problem. I think, whatever the corruption is, it
> >>>>should probably be handled more gracefully by the SD (if my theory is
> >>>>right).
> >>>
> >>>Yes, please do open a bug report -- preferrably one for each problem
> >>> that you consider unrelated.  I am unable to adequately track and
> >>> resolve complicated problems such as this from emails.
> >>>
> >>>>Err.. anyway, please see below for my answers.
> >>>>
> >>>>Martin Simmons wrote:
> >>>>> Theron> Hello,
> >>>>> Theron> I'm seeing some strange behavior with restores under
> >>>>>1.36.3/RHEL 3 using Theron> an AIT-3 drive. I'm not quite sure what is
> >>>>>causing it and I'd really Theron> appreciate any suggestions.
> >>>>>
> >>>>> Theron> When I choose restore option 5 (Select the most recent
> >>>>> backup) bacula Theron> proceeds to restore data from the last full
> >>>>> and subsequent diff/incr Theron> jobs. However, for large restores
> >>>>> (>50 GB), I notice a few dozen error Theron> messages like:
> >>>>> Theron>   Error: attribs.c:339 File size of restored file /foo/bar
> >>>>> not correct. Theron>   Original [file size], restored [large, bogus
> >>>>> file size].
> >>>>>
> >>>>> Theron> Comparing the restores against the live data, I see that the
> >>>>>restored Theron> files have lots of random garbage inserted/appended
> >>>>> to them.
> >>>>>
> >>>>> Theron> However, when I manually find the jobIDs of the
> >>>>>full/diffs/incrs and Theron> restore them individually with restore
> >>>>>option 3, there is no corruption Theron> and the files all seem fine.
> >>>>>
> >>>>>Does "individually" mean one at a time, i.e. repeated use of option 3?
> >>>>>If so, do you get corruption if you enter all the jobIDs into a single
> >>>>>option 3 in the same order as bacula chose from option 5?
> >>>>
> >>>>Yes, individually means one at a time with repeated use of option 3. If
> >>>>I enter the same JobID's from option 5 into option 3, I see exactly the
> >>>>same corruption on the same files as when I use option 5.
> >>>>
> >>>>> Theron> Most of the corrupt files are older than the last full;
> >>>>>Perhaps there's Theron> something in the diff/incr jobs that corrupts
> >>>>>the files from the full Theron> job. However, most of the corrupt
> >>>>> files are older than the last full and Theron> so are not even
> >>>>> present in the diff/incr jobs.
> >>>>>
> >>>>> Theron> Has anyone seen behavior like this or have any ideas about
> >>>>>where to look?
> >>>>>
> >>>>>For a particular restore, is it always the same files that are
> >>>>>corrupted? If yes, is the garbage really random or is it the same
> >>>>>garbage each time?  Also, what happens if you use option 5 but only
> >>>>>mark one of the corrupted files for restore?
> >>>>
> >>>>If I perform two identical restores, I see the same files corrupted
> >>>> with the same garbage. The md5sums of the corresponding files from
> >>>> each restore are a match so the garbage isn't random or at least its
> >>>> dependent on something else.
> >>>>
> >>>>If I mark just one of the corrupted files for restore, there is no
> >>>>corruption in the file.
> >>>
> >>>Can you send me the corresponding bootstrap files for these two cases so
> >>>that I can compare them.
> >>>
> >>>What database are you using, and can you give me an idea how big it is?
> >>>
> >>>>Thanks for your help, I hope I'm on the right track.
> >>>
> >>>Well, you are certainly doing the right things, but it is a bit early to
> >>>tell what the right track really is ...

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] corrupt files on full restore

Reply via email to