We have some jobs that backup SQL databases, so immediately after the
backup is complete we have a "AfterJob" that restores it to a test system,
then we run a DiskToCatalog verify job. However the restore fails with a
Block checksum mismatch and immediately after the verify job succeeds.

The backup job seems to succeed without any errors:
------------------------------------
07-May 21:05 server-dir JobId 206335: Start Backup JobId 206335,
Job=test.domain.com-sql.2019-05-07_21.05.01_27
07-May 21:05 server-dir JobId 206335: Using Device "FileStorage" to write.
07-May 21:05 test.domain.com-fd JobId 206335: shell command: run
ClientRunBeforeJob "scripts/postgresql_dump.sh"
07-May 21:15 server-sd JobId 206335: Elapsed time=00:04:22, Transfer
rate=13.49 M Bytes/second
07-May 21:15 server-sd JobId 206335: Sending spooled attrs to the Director.
Despooling 4,073 bytes ...
07-May 21:15 server-dir JobId 206335: Bacula server-dir 9.0.6 (20Nov17):
  Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
  JobId:                  206335
  Job:                    test.domain.com-sql.2019-05-07_21.05.01_27
  Backup Level:           Incremental, since=2019-05-06 21:11:28
  Client:                 "test.domain.com-fd" 5.2.6 (21Feb12)
x86_64-pc-linux-gnu,debian,jessie/sid
  FileSet:                "test-sql" 2017-08-30 21:05:00
  Pool:                   "File" (From Job resource)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "File" (From Pool resource)
  Scheduled time:         07-May-2019 21:05:01
  Start time:             07-May-2019 21:11:00
  End time:               07-May-2019 21:15:24
  Elapsed time:           4 mins 24 secs
  Priority:               500
  FD Files Written:       15
  SD Files Written:       15
  FD Bytes Written:       3,535,794,319 (3.535 GB)
  SD Bytes Written:       3,535,796,681 (3.535 GB)
  Rate:                   13393.2 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         Vol0043
  Volume Session Id:      4
  Volume Session Time:    1557261339
  Last Volume Bytes:      26,223,400,919 (26.22 GB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

07-May 21:15 server-dir JobId 206335: Begin pruning Jobs older than 2
months .
07-May 21:15 server-dir JobId 206335: No Jobs found to prune.
07-May 21:15 server-dir JobId 206335: Begin pruning Files.
07-May 21:15 server-dir JobId 206335: No Files found to prune.
07-May 21:15 server-dir JobId 206335: End auto prune.

07-May 21:15 server-dir JobId 206335: shell command: run AfterJob
"scripts/restore_by_jobid test.domain.com-fd 206335"
------------------------------------


The restore job soon after fails with this log:
------------------------------------
08-May 00:13 server-dir JobId 206424: Start Restore Job
TestRestoreProductionSQL.2019-05-07_21.15.24_57
08-May 00:13 server-dir JobId 206424: Using Device "FileStorage" to read.
08-May 00:13 server-sd JobId 206424: Ready to read from volume "Vol0043" on
File device "FileStorage" (/home/backup).
08-May 00:13 server-sd JobId 206424: Forward spacing Volume "Vol0043" to
addr=22681860551
08-May 00:13 server-sd JobId 206424: Error: block_util.c:470 Volume data
error at 0:0!
* Block checksum mismatch *in block=1723 len=64512: calc=24873f8f
blk=2deb5777
08-May 00:13 server-sd JobId 206424: Elapsed time=00:00:01, Transfer
rate=111.0 M Bytes/second
08-May 00:13 server-sd JobId 206424: Fatal error: fd_cmds.c:223 Command
error with FD msg="rechdr 4 1557261339 2 2 65536", SD hanging up.
08-May 00:13 test.domain.com-fd JobId 206424: Error: attribs.c:558 File
size of restored file
/var/lib/postgresql/tmp/test-db_postgresql_dump.sql.tar.gz not correct.
Original 464348732, restored 111017984.
08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
(20Nov17):
08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
(20Nov17):
  Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
  JobId:                  206424
  Job:                    TestRestoreProductionSQL.2019-05-07_21.15.24_57
  Restore Client:         test.domain.com-fd
  Start time:             08-May-2019 00:13:33
  End time:               08-May-2019 00:13:35
  Files Expected:         15
  Files Restored:         2
  Bytes Restored:         111,018,596
  Rate:                   55509.3 KB/s
  FD Errors:              1
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Restore Error ***
------------------------------------

Here is the log from the successful verify job 3 minutes after the failed
restore job (though it only takes 16 seconds which seems too fast):
------------------------------------
08-May 00:16 server-dir JobId 206428: Verifying against JobId=206335
Job=test.domain.com-sql.2019-05-07_21.05.01_27
08-May 00:16 server-dir JobId 206428: Start Verify JobId=206428
Level=DiskToCatalog Job=verify-test.domain.com-sql.2019-05-07_22.20.00_32
08-May 00:17 server-dir JobId 206428: Bacula server-dir 9.0.6 (20Nov17):
  Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
  JobId:                  206428
  Job:                    verify-test.domain.com-sql.2019-05-07_22.20.00_32
  FileSet:                test-sql
  Verify Level:           DiskToCatalog
  Client:                 test.domain.com-fd
  Verify JobId:           206335
  Verify Job:
  Start time:             08-May-2019 00:16:47
  End time:               08-May-2019 00:17:03
  Elapsed time:           16 secs
  Files Examined:         16
  Non-fatal FD errors:    0
  FD termination status:  OK
  Termination:            Verify OK

08-May 00:17 server-dir JobId 206428: Begin pruning Jobs older than 2
months .
08-May 00:17 server-dir JobId 206428: No Jobs found to prune.
08-May 00:17 server-dir JobId 206428: Begin pruning Files.
08-May 00:17 server-dir JobId 206428: No Files found to prune.
08-May 00:17 server-dir JobId 206428: End auto prune.
------------------------------------

Clearly the verify job either isn't verifying anything or isn't verifying
what I expect it to be verifying. I noticed in the v7.4.0 release notes
there was a new feature of "Level=Data" for verification jobs, but I
couldn't actually find any more information on that in the documentation,
just information about "DiskToCatalog" and "VolumeToCatalog" verification
levels. Does "Level=Data" actually exist, or is there additional
documentation on it somewhere that I'm missing? This is all critical data,
so we want to do as much verification as possible on it which should in
theory mimic everything a restore would do so we can be 100% certain the
data was intact at that time.

The next question is why is the data being corrupted to begin with? The
bacula server uses a RAID1 BTRFS array to store the pool volumes on  and
doing a BTRFS scrub on the entire block device shows no checksum errors
whatsoever. There are also no hardware errors appearing in the dmesg logs,
and SMART monitoring on the drives isn't showing any errors (drives are a
few months old).

Any ideas what would cause bacula to show checksum errors but BTRFS isn't?

The SQL backup jobs are being run across multiple servers in multiple
countries, and they seem to be the only jobs that we have experienced the
checksum errors with so far and its happening almost every night when the
jobs are scheduled to run. If we run the jobs manually the next morning
everything works fine and the restore succeeds without a problem.

Any insights would be greatly appreciated.
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to