Doytchin Spiridonov wrote:
> Hello,
> 
> done. Found where is the problem after some more tests (and once again
> it is not in our hadrware or OS or broken things). It is where I
> initially suggested - the concurrent jobs.

So you can reliably reproduce the problem now?  Excellent!

> After the first (and native configuration) we used (concurrent jobs,
> with gzip) we tested the following:
> 
> 1. concurrent jobs, w/o gzip
> - we got similar errors (1 wrong filesize from 4 jobs, but 3 of 4 jobs
> with less files than expected, the 4th usually is very small - 100
> files - and never had errors, so I would say 100% of jobs was invalid)
> 
> 2. no concurrent jobs (Maximum Concurrent Jobs = 1 at dir and sd), w/o
> gzip
> - good news, all restores are OK, no errors, Files Expected and Files
> Restored match!
> 
> 3. no concurrent jobs WITH gzip
> - again OK, all restores are OK, no errors, Files Expected and Files
> Restored match!

Okay, so it looks like you can reproduce the symptoms just with multiple
concurrent jobs, regardless of the gzip settings.

> So until now we have:
> - the problem is not caused by a corrupted file system
> - volumes are consistent and bls doesn't show errors
> - MySQL is OK (initially 4.1.x now 5.0.37)
> - when running concurrent jobs both 2.0.3 and 2.1.28 say backups are
> OK but restores fail with one of the 3 kinds of errors listed below
> - when concurrent jobs are turned off everything is OK
> - gzip on/off doesn't affect the errors

I realize that you mentioned in another email you're dumping the mysql tables
nightly, but I would still strongly recommend that you run a repair tables on
your catalog to be absolutely sure there isn't any subtle corruption that's
snuck in.  It pays to be painfully methodical when troubleshooting this kind
of scenario, especially since you seem to be the first to knowingly run into
this problem.

Another good thing to try would be to double check and make sure that your
catalog schema exactly matches what bacula is expecting.  If, for example, the
column type holding volume offsets somehow became a 16 bit int where bacula
was expecting a 32 bit, the inserted values could become truncated or wrap
around, causing the kind of corruption you're seeing.

Actually, that gives me another idea.  While I've never used it myself, you
may be able to get more details by running some jobs with strict mode turned
on on your mysql catalog.

http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html

If your bacula installation is doing something that would cause the data
stored to be wrong, such as storing a value that doesn't fit in the column
type, I believe this should turn it from a silent warning into a fatal error,
making it easier to track down.

Also, it's been suggested that you try turning on spooling.  Have you done so?

> Once again the 3 types of errors are:
> 
> 1. some static files (i.e. not log files!) are restored with wrong
> (always larger) size, while first N bytes match, and the rest is
> filled with a part of another file (not sure if this is just file with
> a wrong size and some old data at the disk appears at the end, or
> bacula restores part of another file and append it to the end). The
> file can be restored correctly if marked alone but the error 3. below
> is generated (which seems to be just a bogus error). An example error is:
> ---
> b0: Restore_b0.d6.int.2007-07-23_22.37.34 Error: attribs.c:410 File size
> of restored file
> /home/bacula/res/b3.2/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm
> not correct. Original 3826291, restored 10620921.
> ---
> When this error is present (always) the second error below (but w/o
> additional error messages) is present as well (missing files)
> 
> 2. large amount of files are missing (while they are present in the
> catalog and selected) - tens of thousands (not sockets or anything
> else that Bacula ignores by default). When this happens usually an
> error like this appear (if not the first one above):
> ---
> b3: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: Record header
> file index 42452 not equal record index 0
> Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: read.c:124
> Error sending to File daemon. ERR=Connection reset by peer
> Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Error: bsock.c:306
> Write error sending 30 bytes to client:10.2.1.13:36643: ERR=Connection
> reset by peer
> ---
> 
> 3. when a file from error 1 is restored alone it is OK, but another
> bogus error is generated:
> ---
> Storage: Restore_b0.d6.int.2007-07-23_22.57.42 Error: block.c:275
> Volume data error at 0:3999743252! Wanted ID: "BB02", got "Иnлу".
> Buffer discarded.
> ---
> Found that the above number (3999743252) is not present as block
> address for any block in the volumes, but the same number appears as
> part of JobMedia record in the database.
> 
> 
> This is everything in 2.1.28 sumarized, that poped up as a problem or
> fact.
> (2.0.3 had another bug with bogus errors about sockets' attributes and
> 2.1.26 had a bogus SQL error messages but those are fixed OK in
> 2.1.28).
> 
> If anyone wants, feel free to reopen the bug in Mantis (903). I'm not
> going to do so as I am personally disappointed by the attitude "this
> is not a bug - work it out yourself" and the suggestion to send you
> our servers as a gift to test with, plus support fees... nice. Now
> it's up to you to create better test cases to catch more bugs if any.

Be fair, now - no one has suggested that there isn't a problem here, merely
that without enough information to reliably reproduce the problem it's
unrealistic to expect that it's going to be fixed.  If no one else is having
the problem, it's very difficult to find out what the glitch is and how to fix
it.  Believe me, I've been on both sides of these kinds of glitches, and I
understand what you're feeling, but it doesn't help any.

Just take a look at what I went through with #888, and the email thread
'spurious connection drops killing backups'.  No one else was seeing my
problem, and it took me a month to get enough data to find a fix for the
symptoms I was seeing.

> We will start our backup again w/o concurrent jobs and we will
> continue to monitor restores on a daily basis as the above tests are
> just 3 and I agree there is a posibility that it was just a chance
> that the later two tests went OK. But it was my suggestion from the
> beginning that the problem is Bacula damages either database numbers
> or volume records when concurrent jobs are running and so far the
> facts proved this.

So far the facts suggest this, but suggesting is a long way from proving, and
even further from finding a real fix.

If you'd like further help with a real fix, the best thing to do would be to
extract out just enough of your config for someone else to reproduce the
problem.  That way the developers can pull out debug info, experiment with
fixes, etc, without making you try everything they need done.  One way to do
that might also be to write up a test case to go in the regression tree, to
guarantee that this problem doesn't creep back in later on.

> (!) The workaround for the problem is to switch off concurrent jobs as
> if not - the chance you have invalid backups are high (some 90% from
> our own cases and at least with our servers/os/configuration; this is
> so if it is not said that 100% of backups are wrong as after
> diff/incremental backups Bacula restores files that are deleted which
> is really a bad behaviour in many cases/services).

Obviously that's not a very good workaround in the long run, especially for
those of us with multiple drives.

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Network Engineer          |  is simple, elegant, and wrong. - HL Mencken
    GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to