Hello,

Tuesday, July 24, 2007, 2:00:43 PM:

FS> Okay, so it looks like you can reproduce the symptoms just with multiple
FS> concurrent jobs, regardless of the gzip settings.

I am sure the file/dirs backed up are important! I bet developers are
tested enough concurrent jobs but if they didn't catched the problem
then the dir structure/file numbers/size is important. To give a
picture of our test environement - I am testing with 4 jobs, 4
separate servers, 50K - 350K files each and 2-7GB of data. One of the
job is backing up the bacula server itself (not sure if this matters;
I noted a possible problem with naming the daemons with same names and
so temp files overwritten, but this is not our case).

>> So until now we have:
>> - the problem is not caused by a corrupted file system
>> - volumes are consistent and bls doesn't show errors
>> - MySQL is OK (initially 4.1.x now 5.0.37)
>> - when running concurrent jobs both 2.0.3 and 2.1.28 say backups are
>> OK but restores fail with one of the 3 kinds of errors listed below
>> - when concurrent jobs are turned off everything is OK
>> - gzip on/off doesn't affect the errors

FS> I realize that you mentioned in another email you're dumping the mysql 
tables
FS> nightly, but I would still strongly recommend that you run a repair tables 
on
FS> your catalog to be absolutely sure there isn't any subtle corruption that's
FS> snuck in.  It pays to be painfully methodical when troubleshooting this kind
FS> of scenario, especially since you seem to be the first to knowingly run into
FS> this problem.

FS> Another good thing to try would be to double check and make sure that your
FS> catalog schema exactly matches what bacula is expecting.  If, for example, 
the
FS> column type holding volume offsets somehow became a 16 bit int where bacula
FS> was expecting a 32 bit, the inserted values could become truncated or wrap
FS> around, causing the kind of corruption you're seeing.

FS> Actually, that gives me another idea.  While I've never used it myself, you
FS> may be able to get more details by running some jobs with strict mode turned
FS> on on your mysql catalog.

FS> http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html

FS> If your bacula installation is doing something that would cause the data
FS> stored to be wrong, such as storing a value that doesn't fit in the column
FS> type, I believe this should turn it from a silent warning into a fatal 
error,
FS> making it easier to track down.

FS> Also, it's been suggested that you try turning on spooling.  Have you done 
so?

Nice suggestion. Will try it and spooling as well. This probably will
cut the possibilities to half as the problem is either with the wrong
database data or wrong data in volumes (or both).

Re mysql check - as we "fixed" the problem yesterday I don't have a DB
now to check against but I'll start a new backup the old way to got
the problem and to verify the DB just to remove that posibility. (We
have one spare server for Bacula tests and in fact there is no Xen and
LVM but were getting the problem there as well, this to prove also
that it is not Xen or LVM related, I forgot to mention this
yesterday).

Done and more info: surprised, this time only one of 4 jobs had a
problem and strange - at a similar place (I recall the filename once
was broken - its from the same dir). Anyway - the file size was
different and (type of error 1).

Checked the bacula tables - no problems, all had status OK.

BUT, I see at this server it happened 1 out of 4 jobs, while at the
other 4 of 4 (which was much better for testing). I think if I enable
spooling if I get no errors this couldn't mean spooling solved the
problem, it could be just a good chance. As you see it doesn't happen
always nor for all jobs. But I will run several more tests anyway.

Now running the same with spooling. First impression is that I noted
for 4 jobs it is writing to 8 different files. This is not so good for
performance and it would be the same to define different pool for
every job wouldn't it? If the spooling fixes the problem (i.e.
separate write for every job) this would mean that separate pool will
do the same, saving some time for data transfer between files?

>> (!) The workaround for the problem is to switch off concurrent
>> jobs...

FS> Obviously that's not a very good workaround in the long run, especially for
FS> those of us with multiple drives.

This is why I also asked earlier yesterday about comparison w/ or w/o
concurrent jobs or writing to separate volumes, as I was sure we will
end with no concurrent jobs but as Wolfgang is sharing - it is slower.

Regards.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to