Hello,

I forgot to mention something very IMPORTANT: I discovered that in
*all* of such cases (restored files with larger size), if we don't
perform full restore, but restore a SINGLE file, it is restored OK
with *correct* size and content. It is OK even if we restore the
directory where it is (with the other files in it).

Which proves its is not a problem with the FS, kernel, xen, lvm,
hardware, etc, but it is a problem with Bacula.

Regards


Monday, July 23, 2007, 9:57:40 PM:

DS> Hello,

DS> I've filed this as a bug, but while Kern couldn't reproduce it he gave
DS> up. So let us find here what could be the problem. There are actually
DS> two problems, they could be linked.

DS> Here is the history:
DS> Initially we were using 2.0.3. Running backups for several weeks I
DS> wanted to restore a file and was surprised that I can't restore it. It
DS> was listed in the catalog, I could select it and run a restore job,
DS> but the file didn't come up. Investigating what happened I run a full
DS> restore job and was surprised that in that directory (where the file
DS> is) several files are missing. Also the error message similar to the
DS> one in my first post here were present. In addition to it there was a
DS> big difference between marked files and actually restored files (sure
DS> not hard links, sockets or anything else that is ignored by Bacula -
DS> at one of the tests the whole /home/ directory was missing).
DS> After that we startd with tests (backup full/diff/inc, restore etc)
DS> for a week. Every time (but at random places/files) similar error
DS> happen. Sometimes there are errors, sometimes not. Haven't run so much
DS> tests so I could come up with a decision when this happens. But IT
DS> HAPPENS and as a result we don't have a reliable backup. I know a lot
DS> of people run backups w/o testing restores and that's why (if this is
DS> not related to our specific setup) those problem could appear only if
DS> they have emergency which actually doesn't happen often. Anyway, here
DS> are the hardware and setup details:

DS> *** Bacula: 2.1.28 on all servers.
>>From yesterday we cleaned everything (bacula DB and volumes) and
DS> installed everywhere the latest beta *2.1.28* (note this is not the
DS> problem of the beta as we discovered when we had 2.0.3). 2.1.28 fixed
DS> 2 other problems we discovered with 2.0.3, but this one is still
DS> there.
DS> Director and most of the servers are 64 bit, two of the servers are 32
DS> bit.
DS> *** OS: Linux CentOS 4.5
DS> *** MySQL: 5.0.37
DS> *** Servers (all are almost identical): Supermicro, PDSME - Intel
DS> E7230 (Mukilteo) chipset, Intel Pentium D 930 Dual Core 3.0GHz, 3Ware
DS> IDE RAID Controller Escalade 9550SX. Servers have 4 disks each in RAID
DS> 1+0, only the Bacula server has many disks in RAID 5.
DS> *** Some servers are plain CentOS, some have Xen with virtual servers,
DS> the Bacula server itsels also has Xen, but the Bacula is running in
DS> Dom0, no other virtual machines at this time are running on it.
DS> *** Those servers with Xen als have LVM.
DS> *** We run (and I guess here is the problem of Bacula) concurrent
DS> jobs.
DS> *** GZIP compression is enabled.
DS> *** we save volumes on harddisk, their size is set to 4480MB

DS> --- How to get an error:
DS> As initially we discovered the error after several weeks of backups,
DS> We guessed that this could ba caused by us by a wrong setting of
DS> Volume Retention or any other Retention time and some files are
DS> purged.

DS> We started everything from zero again, and after 3 days (it happened
DS> that the first was Full, the next Differential and the last
DS> Incremental) we performed a test and that error happened again! So we
DS> were sure this is not caused by purge of some files accidentally.

DS> After that we could get that error even after just a full backup,
DS> trying to restore immediately after it is finished.

DS> Yesterday we cleaned everything again and compiled (from SRPMs) the
DS> latest 2.1.28.

DS> We run again full backup (again all concurret jobs) and the errors
DS> described here happen when we try to restore files from every job
DS> (except one where there are just 150 files).

DS> So the problems are two:
DS> - sometimes some files are restored with higher size, while the first
DS> part of the file matches exactly the original file (not log files or
DS> dynamic files) This happens on very rare cases (~one case per 5 jobs)
DS> - sometimes not all files are restored, but tens of thousands are
DS> missing, an example:
DS>   Files Expected:         190,718
DS>   Files Restored:         166,097
DS> This happens more often (~one case per 2 jobs).

DS> Note that once the error happens we can reproduce it on every restore
DS> at the same place for the same file and the same number of missing
DS> files (i.e. this is not a problem of restore, it is most a problem of
DS> volumes).

DS> What are our future tests:
DS> 1. we will do the same (concurrent jobs) but w.o using GZIP
DS> 2. if it happens again we will set max jobs to 1 so every job is run
DS> alone. Because when testing AFAIR we didn't get errors when we run
DS> just one full backup job. This always happen when we do several at
DS> once (but I am not 100% sure, thats why we will test this)
DS> 3. if it still happens we will run it with normal kernel (so to exclude
DS> the Xen influence)
DS> 4. last we will try w/o LVM (which would be harder)

DS> Regards
DS> P.S. sorry for my English :)


DS> Monday, July 23, 2007, 9:03:45 PM:

RN>> -----BEGIN PGP SIGNED MESSAGE-----
RN>> Hash: SHA1

RN>> Doytchin Spiridonov wrote:
>>> Hello,
>>> 
>>> trying to identify a bug in bacula and/or our system setup.
>>> 
>>> Is there anyone that on restore had errors like this:
>>> 
>>> Error: attribs.c:410 File size of restored file
>>> /home/bacula/res/b3/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm
>>> not correct. Original 3826291, restored 10620921.
>>> 
>>> - the file is not a log file or any file that has changed during the
>>> backup (in which cases an error like the one above should be normal)
>>> 
>>> - the wrong file size is always larger that the original; if we cut
>>> the first N bytes, where the N is the correct file size, the original
>>> and restored files match; we noted that the appended data is part of
>>> another file from the backup, not a garbage data. Note that this other
>>> file (from which some part has been appended to the file with wrong
>>> size) is restored correctly, so the only problem is wrong file size
>>> decision by bacula and reading further than its end (seems this is
>>> some internal buffer of Bacula as the data is stored in the volumes
>>> using GZIP and just reading further would break everything and the
>>> appended data should be garbage, not unzipped data).

RN>> This has been brought up several times within the last week, but never
RN>> with the explanation and examination. I wonder if some of the other who
RN>> have experienced it (I do not know their names -- hopefully they can
RN>> chime in) can do the same thing for us. This is potentially serious,
RN>> seems like, if it is a widespread problem.

RN>> I think if the others can verify it, this should also be copied to
RN>> Bacula devel. I think I will try a large restore of my own today to see
RN>> what happens.

RN>> Please give the rest of the details of your setup, however -- you don't
RN>> even include the Bacula version, and that is a very basic piece of
RN>> information. Operating system (presumably RedHat Linux from the file you
RN>> backed up, but who knows), architecture... all would be useful.


DS> -------------------------------------------------------------------------
DS> This SF.net email is sponsored by: Splunk Inc.
DS> Still grepping through log files to find problems?  Stop.
DS> Now Search log events and configuration files using AJAX and a browser.
DS> Download your FREE copy of Splunk now >>  http://get.splunk.com/
DS> _______________________________________________
DS> Bacula-users mailing list
DS> Bacula-users@lists.sourceforge.net
DS> https://lists.sourceforge.net/lists/listinfo/bacula-users


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to