We have some jobs that backup SQL databases, so immediately after the backup is complete we have a "AfterJob" that restores it to a test system, then we run a DiskToCatalog verify job. However the restore fails with a Block checksum mismatch and immediately after the verify job succeeds.
The backup job seems to succeed without any errors: ------------------------------------ 07-May 21:05 server-dir JobId 206335: Start Backup JobId 206335, Job=test.domain.com-sql.2019-05-07_21.05.01_27 07-May 21:05 server-dir JobId 206335: Using Device "FileStorage" to write. 07-May 21:05 test.domain.com-fd JobId 206335: shell command: run ClientRunBeforeJob "scripts/postgresql_dump.sh" 07-May 21:15 server-sd JobId 206335: Elapsed time=00:04:22, Transfer rate=13.49 M Bytes/second 07-May 21:15 server-sd JobId 206335: Sending spooled attrs to the Director. Despooling 4,073 bytes ... 07-May 21:15 server-dir JobId 206335: Bacula server-dir 9.0.6 (20Nov17): Build OS: x86_64-pc-linux-gnu ubuntu 18.04 JobId: 206335 Job: test.domain.com-sql.2019-05-07_21.05.01_27 Backup Level: Incremental, since=2019-05-06 21:11:28 Client: "test.domain.com-fd" 5.2.6 (21Feb12) x86_64-pc-linux-gnu,debian,jessie/sid FileSet: "test-sql" 2017-08-30 21:05:00 Pool: "File" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "File" (From Pool resource) Scheduled time: 07-May-2019 21:05:01 Start time: 07-May-2019 21:11:00 End time: 07-May-2019 21:15:24 Elapsed time: 4 mins 24 secs Priority: 500 FD Files Written: 15 SD Files Written: 15 FD Bytes Written: 3,535,794,319 (3.535 GB) SD Bytes Written: 3,535,796,681 (3.535 GB) Rate: 13393.2 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Vol0043 Volume Session Id: 4 Volume Session Time: 1557261339 Last Volume Bytes: 26,223,400,919 (26.22 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 07-May 21:15 server-dir JobId 206335: Begin pruning Jobs older than 2 months . 07-May 21:15 server-dir JobId 206335: No Jobs found to prune. 07-May 21:15 server-dir JobId 206335: Begin pruning Files. 07-May 21:15 server-dir JobId 206335: No Files found to prune. 07-May 21:15 server-dir JobId 206335: End auto prune. 07-May 21:15 server-dir JobId 206335: shell command: run AfterJob "scripts/restore_by_jobid test.domain.com-fd 206335" ------------------------------------ The restore job soon after fails with this log: ------------------------------------ 08-May 00:13 server-dir JobId 206424: Start Restore Job TestRestoreProductionSQL.2019-05-07_21.15.24_57 08-May 00:13 server-dir JobId 206424: Using Device "FileStorage" to read. 08-May 00:13 server-sd JobId 206424: Ready to read from volume "Vol0043" on File device "FileStorage" (/home/backup). 08-May 00:13 server-sd JobId 206424: Forward spacing Volume "Vol0043" to addr=22681860551 08-May 00:13 server-sd JobId 206424: Error: block_util.c:470 Volume data error at 0:0! * Block checksum mismatch *in block=1723 len=64512: calc=24873f8f blk=2deb5777 08-May 00:13 server-sd JobId 206424: Elapsed time=00:00:01, Transfer rate=111.0 M Bytes/second 08-May 00:13 server-sd JobId 206424: Fatal error: fd_cmds.c:223 Command error with FD msg="rechdr 4 1557261339 2 2 65536", SD hanging up. 08-May 00:13 test.domain.com-fd JobId 206424: Error: attribs.c:558 File size of restored file /var/lib/postgresql/tmp/test-db_postgresql_dump.sql.tar.gz not correct. Original 464348732, restored 111017984. 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6 (20Nov17): 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6 (20Nov17): Build OS: x86_64-pc-linux-gnu ubuntu 18.04 JobId: 206424 Job: TestRestoreProductionSQL.2019-05-07_21.15.24_57 Restore Client: test.domain.com-fd Start time: 08-May-2019 00:13:33 End time: 08-May-2019 00:13:35 Files Expected: 15 Files Restored: 2 Bytes Restored: 111,018,596 Rate: 55509.3 KB/s FD Errors: 1 FD termination status: Error SD termination status: Error Termination: *** Restore Error *** ------------------------------------ Here is the log from the successful verify job 3 minutes after the failed restore job (though it only takes 16 seconds which seems too fast): ------------------------------------ 08-May 00:16 server-dir JobId 206428: Verifying against JobId=206335 Job=test.domain.com-sql.2019-05-07_21.05.01_27 08-May 00:16 server-dir JobId 206428: Start Verify JobId=206428 Level=DiskToCatalog Job=verify-test.domain.com-sql.2019-05-07_22.20.00_32 08-May 00:17 server-dir JobId 206428: Bacula server-dir 9.0.6 (20Nov17): Build OS: x86_64-pc-linux-gnu ubuntu 18.04 JobId: 206428 Job: verify-test.domain.com-sql.2019-05-07_22.20.00_32 FileSet: test-sql Verify Level: DiskToCatalog Client: test.domain.com-fd Verify JobId: 206335 Verify Job: Start time: 08-May-2019 00:16:47 End time: 08-May-2019 00:17:03 Elapsed time: 16 secs Files Examined: 16 Non-fatal FD errors: 0 FD termination status: OK Termination: Verify OK 08-May 00:17 server-dir JobId 206428: Begin pruning Jobs older than 2 months . 08-May 00:17 server-dir JobId 206428: No Jobs found to prune. 08-May 00:17 server-dir JobId 206428: Begin pruning Files. 08-May 00:17 server-dir JobId 206428: No Files found to prune. 08-May 00:17 server-dir JobId 206428: End auto prune. ------------------------------------ Clearly the verify job either isn't verifying anything or isn't verifying what I expect it to be verifying. I noticed in the v7.4.0 release notes there was a new feature of "Level=Data" for verification jobs, but I couldn't actually find any more information on that in the documentation, just information about "DiskToCatalog" and "VolumeToCatalog" verification levels. Does "Level=Data" actually exist, or is there additional documentation on it somewhere that I'm missing? This is all critical data, so we want to do as much verification as possible on it which should in theory mimic everything a restore would do so we can be 100% certain the data was intact at that time. The next question is why is the data being corrupted to begin with? The bacula server uses a RAID1 BTRFS array to store the pool volumes on and doing a BTRFS scrub on the entire block device shows no checksum errors whatsoever. There are also no hardware errors appearing in the dmesg logs, and SMART monitoring on the drives isn't showing any errors (drives are a few months old). Any ideas what would cause bacula to show checksum errors but BTRFS isn't? The SQL backup jobs are being run across multiple servers in multiple countries, and they seem to be the only jobs that we have experienced the checksum errors with so far and its happening almost every night when the jobs are scheduled to run. If we run the jobs manually the next morning everything works fine and the restore succeeds without a problem. Any insights would be greatly appreciated.
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users