On Friday 05 August 2005 21:29, Theron Toomey wrote: > Thanks Kern, > Changing the names of the daemons solved the problem. Given the strange > behavior, it wouldn't have occurred to me that was the cause but it > makes perfect sense.
Thanks for the feedback. It is always nice to have a confirmation. Yes, the downside of having the same names isn't so obvious. I've updated the doc to have an explicit warning about it. > > Kern Sibbald wrote: > > On Thursday 04 August 2005 19:42, Theron Toomey wrote: > >>Hello, > >>I initially thought this problem was due to corruption in my database. > >>However, the behavior seems to be caused by the SD, FD, and DIR sharing > >>a working directory. When I assign the FD a different working-dir from > >>the DIR/SD (e.g. WorkingDirectory = "/var/bacula/fd"), my restores work > >>perfectly. > > > > Yes, thanks for figuring this out. The problem is that you did not as > > recommended, and as is the default, give your daemons unique names. I'll > > improve the documentation on this. > > > >>I have opened a bug on this here: > >>http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000398 > >> > >>GDB tracebacks, errors, conf files, and bconsole output demonstrating > >>the problem: > >>http://www.duke.edu/~ttoomey/misc/ttoomey-bacula-wd-dbg.20050803.tar.gz > >> > >>As a workaround, I have separated my FD working dir from the DIR/SD. > >>Curiously, when I separate the DIR and SD working dirs (so each daemon > >>has its own dir), my autochanger stops working. That's a different issue > >>though, and one that I haven't had time to investigate. > >> > >>Thanks for all your help. > >> > >>Kern Sibbald wrote: > >>>On Tuesday 26 July 2005 04:10, Theron Toomey wrote: > >>>>Hi, thanks for the suggestions. Sorry it took me a few days to respond- > >>>>there's not much time for testing between daily backup cycles. > >>>> > >>>>My current theory is that there is some strange corruption in my DB, > >>>>perhaps in the File table. > >>>> > >>>>I'm not sure but I think this may be related to another problem I'm > >>>>having. Restores (using option 3 or 5) of very large jobs (around 200 > >>>>GB) fail while writing the restore bootstrap. I suspect that while > >>>>reading my catalog to generate the restore.bsr, bacula is encountering > >>>>some corruption, which may also explain the strange garbage in my other > >>>>restores. > >>> > >>>Could you send me the console and any output from this "fail" so I can > >>>see what is going on. > >>> > >>>>This isn't necessarily pertinent but I have seen a couple interesting > >>>>results with these large restores, varying from the SD segfaulting > >>>>immediately to sitting in an infinite loop, eating about half the > >>>> system memory, and then segfaulting (no, its not using the tls lib). > >>>> Here's a gdb trace of the latter behavior if you are curious: > >>>>http://www.duke.edu/~ttoomey/misc/bacula-sd-debug.3.txt.gz > >>> > >>>Could you send me the bootstrap file from this? When doing the restore > >>>and it reaches the question yes/mod/no, it will have printed the > >>> location of the bootstrap file just prior to issuing the prompt. Before > >>> answering the prompt, you can copy it to another location (after > >>> answering the prompt, it usually deletes the file). > >>> > >>>>After running dbcheck, it did cough up an error while restoring before > >>>>the SD died: > >>>>25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: Bootstrap > >>>>file error: expected an integer or a range, got T_EOL: = > >>>> > >>>> : Line 5543394, col 10 of file > >>>> > >>>>/var/bacula/fury.restore.2005-07-25_11.03.07.bootstrap > >>>>FileIndex= > >>>>25-Jul 11:05 fury: restore.2005-07-25_11.03.07 Fatal error: job.c:1662 > >>>>Comm error with SD. bad response to Bootstrap. ERR=No data available > >>>> > >>>>I plan on filing a bug about the SD issue after I do some more testing > >>>>to try and isolate the problem. I think, whatever the corruption is, it > >>>>should probably be handled more gracefully by the SD (if my theory is > >>>>right). > >>> > >>>Yes, please do open a bug report -- preferrably one for each problem > >>> that you consider unrelated. I am unable to adequately track and > >>> resolve complicated problems such as this from emails. > >>> > >>>>Err.. anyway, please see below for my answers. > >>>> > >>>>Martin Simmons wrote: > >>>>> Theron> Hello, > >>>>> Theron> I'm seeing some strange behavior with restores under > >>>>>1.36.3/RHEL 3 using Theron> an AIT-3 drive. I'm not quite sure what is > >>>>>causing it and I'd really Theron> appreciate any suggestions. > >>>>> > >>>>> Theron> When I choose restore option 5 (Select the most recent > >>>>> backup) bacula Theron> proceeds to restore data from the last full > >>>>> and subsequent diff/incr Theron> jobs. However, for large restores > >>>>> (>50 GB), I notice a few dozen error Theron> messages like: > >>>>> Theron> Error: attribs.c:339 File size of restored file /foo/bar > >>>>> not correct. Theron> Original [file size], restored [large, bogus > >>>>> file size]. > >>>>> > >>>>> Theron> Comparing the restores against the live data, I see that the > >>>>>restored Theron> files have lots of random garbage inserted/appended > >>>>> to them. > >>>>> > >>>>> Theron> However, when I manually find the jobIDs of the > >>>>>full/diffs/incrs and Theron> restore them individually with restore > >>>>>option 3, there is no corruption Theron> and the files all seem fine. > >>>>> > >>>>>Does "individually" mean one at a time, i.e. repeated use of option 3? > >>>>>If so, do you get corruption if you enter all the jobIDs into a single > >>>>>option 3 in the same order as bacula chose from option 5? > >>>> > >>>>Yes, individually means one at a time with repeated use of option 3. If > >>>>I enter the same JobID's from option 5 into option 3, I see exactly the > >>>>same corruption on the same files as when I use option 5. > >>>> > >>>>> Theron> Most of the corrupt files are older than the last full; > >>>>>Perhaps there's Theron> something in the diff/incr jobs that corrupts > >>>>>the files from the full Theron> job. However, most of the corrupt > >>>>> files are older than the last full and Theron> so are not even > >>>>> present in the diff/incr jobs. > >>>>> > >>>>> Theron> Has anyone seen behavior like this or have any ideas about > >>>>>where to look? > >>>>> > >>>>>For a particular restore, is it always the same files that are > >>>>>corrupted? If yes, is the garbage really random or is it the same > >>>>>garbage each time? Also, what happens if you use option 5 but only > >>>>>mark one of the corrupted files for restore? > >>>> > >>>>If I perform two identical restores, I see the same files corrupted > >>>> with the same garbage. The md5sums of the corresponding files from > >>>> each restore are a match so the garbage isn't random or at least its > >>>> dependent on something else. > >>>> > >>>>If I mark just one of the corrupted files for restore, there is no > >>>>corruption in the file. > >>> > >>>Can you send me the corresponding bootstrap files for these two cases so > >>>that I can compare them. > >>> > >>>What database are you using, and can you give me an idea how big it is? > >>> > >>>>Thanks for your help, I hope I'm on the right track. > >>> > >>>Well, you are certainly doing the right things, but it is a bit early to > >>>tell what the right track really is ... -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users