Some updates on this issue To recap 2 nodes, primary and secondary, after an initial sync the secondary is rebooted and a verify always detects out of sync sectors.
It seems to occur on 9.2.4 version of the driver as well as well as all of the kernel versions we have been using, so appears to be unrelated to any changes we have made in system startup. We are working around this problem by invalidating the disk and doing a full resync after a reboot, this is fairly onerous for large disks. We have not been able to verify corruption when no connection is made back to another node after the reboot, but this is harder to validate as system may boot with corruption What expectations should we have for integrity on a shutdown? Reboot? Power loss? Where could we look closer at trying to understand this issue? ________________________________________ From: Tim Westbrook <tim_westbr...@selinc.com> Sent: Tuesday, December 24, 2024 11:01 AM To: drbd-user@lists.linbit.com <drbd-user@lists.linbit.com> Subject: Verify consistently fails after rebooting secondary node Hello We are observing the following issue with resync after reboot. After rebooting a secondary node (in a 2 or 3 node cluster), the secondary successfully connects to primary and reports UpToDate, but when a verify is launched on the secondary node that was rebooted, it reports out of sync blocks. If an "invalidate --reset-bitmap=no" is issued on the resource on the secondary node, the invalidate sync happens quickly and the next verify succeeds with no out of sync blocks. This was initially detected when we promoted a backup node and it came up with disk corruption. We traced this to the reboot occurring before the promotion. Versions The logs attached are using the 9.2.12 version of the driver on the 5.15.173 kernel, but we have also observed this issue on the 9.2.4 driver with the 5.15.166 kernel We have not seen the problem on 5.15.151 and version 9.2.4 of the driver. Attachments initsyncandverify_noreboot.txt - drbd logs from system prior to reboot , includes verify before reboot verify_after_invalidate_no_reset.txt - drbd logs after reboot show initial failed verify then, invalidate, then successful verify dynamic.res - drbd conf file - note use of separate metadata disk - we also Secondary Bring Up Secondary nodes enable drbd "persist" resource as follows """ da up all || true da secondary persist || true da disconnect persist || true da -- --discard-my-data connect persist || true """