Hello,
Tuesday, July 24, 2007, 2:00:43 PM: FS> Okay, so it looks like you can reproduce the symptoms just with multiple FS> concurrent jobs, regardless of the gzip settings. I am sure the file/dirs backed up are important! I bet developers are tested enough concurrent jobs but if they didn't catched the problem then the dir structure/file numbers/size is important. To give a picture of our test environement - I am testing with 4 jobs, 4 separate servers, 50K - 350K files each and 2-7GB of data. One of the job is backing up the bacula server itself (not sure if this matters; I noted a possible problem with naming the daemons with same names and so temp files overwritten, but this is not our case). >> So until now we have: >> - the problem is not caused by a corrupted file system >> - volumes are consistent and bls doesn't show errors >> - MySQL is OK (initially 4.1.x now 5.0.37) >> - when running concurrent jobs both 2.0.3 and 2.1.28 say backups are >> OK but restores fail with one of the 3 kinds of errors listed below >> - when concurrent jobs are turned off everything is OK >> - gzip on/off doesn't affect the errors FS> I realize that you mentioned in another email you're dumping the mysql tables FS> nightly, but I would still strongly recommend that you run a repair tables on FS> your catalog to be absolutely sure there isn't any subtle corruption that's FS> snuck in. It pays to be painfully methodical when troubleshooting this kind FS> of scenario, especially since you seem to be the first to knowingly run into FS> this problem. FS> Another good thing to try would be to double check and make sure that your FS> catalog schema exactly matches what bacula is expecting. If, for example, the FS> column type holding volume offsets somehow became a 16 bit int where bacula FS> was expecting a 32 bit, the inserted values could become truncated or wrap FS> around, causing the kind of corruption you're seeing. FS> Actually, that gives me another idea. While I've never used it myself, you FS> may be able to get more details by running some jobs with strict mode turned FS> on on your mysql catalog. FS> http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html FS> If your bacula installation is doing something that would cause the data FS> stored to be wrong, such as storing a value that doesn't fit in the column FS> type, I believe this should turn it from a silent warning into a fatal error, FS> making it easier to track down. FS> Also, it's been suggested that you try turning on spooling. Have you done so? Nice suggestion. Will try it and spooling as well. This probably will cut the possibilities to half as the problem is either with the wrong database data or wrong data in volumes (or both). Re mysql check - as we "fixed" the problem yesterday I don't have a DB now to check against but I'll start a new backup the old way to got the problem and to verify the DB just to remove that posibility. (We have one spare server for Bacula tests and in fact there is no Xen and LVM but were getting the problem there as well, this to prove also that it is not Xen or LVM related, I forgot to mention this yesterday). Done and more info: surprised, this time only one of 4 jobs had a problem and strange - at a similar place (I recall the filename once was broken - its from the same dir). Anyway - the file size was different and (type of error 1). Checked the bacula tables - no problems, all had status OK. BUT, I see at this server it happened 1 out of 4 jobs, while at the other 4 of 4 (which was much better for testing). I think if I enable spooling if I get no errors this couldn't mean spooling solved the problem, it could be just a good chance. As you see it doesn't happen always nor for all jobs. But I will run several more tests anyway. Now running the same with spooling. First impression is that I noted for 4 jobs it is writing to 8 different files. This is not so good for performance and it would be the same to define different pool for every job wouldn't it? If the spooling fixes the problem (i.e. separate write for every job) this would mean that separate pool will do the same, saving some time for data transfer between files? >> (!) The workaround for the problem is to switch off concurrent >> jobs... FS> Obviously that's not a very good workaround in the long run, especially for FS> those of us with multiple drives. This is why I also asked earlier yesterday about comparison w/ or w/o concurrent jobs or writing to separate volumes, as I was sure we will end with no concurrent jobs but as Wolfgang is sharing - it is slower. Regards. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users