Hi 

I am running into an issue with disconnect during initial sync not resuming 
sync after reconnecting

While using drbd 9.2.7, primary and two secondaries, the primary has existing 
data that I need to preserve

The mesh is such that all nodes can reach all other nodes

1. After adding the new nodes and performing an adjust on the primary
2. The remote nodes start to initially sync (this usually takes 12 minutes in 
this test scenario (32g disks))
3. After 5 minutes I disable the network interface on the primary (the 
secondaries can still see each other)
4. Wait 20 minutes or longer - length doesn't seem to matter as long as is 
longer than normal sync time
5. Re-enable, instantly backup nodes are both up to date.
6. Run data integrity check, it fails marking a lot of data as out of sync

This is very reproducible

I have attached my configuration file[1] and the kernel logs snippets from 
primary[2] and a backup [3]

Here is the result of verify

[ 3960.135562] drbd persist/0 drbd0 node-40274784292070177166: Online verify 
done (total 379 sec; paused 0 sec; 72328 K/sec)
[ 3960.135570] drbd persist/0 drbd0 node-40274784292070177166: Online verify 
found 203568 4k blocks out of sync!

Is DRBD resilient to disconnects during initial sync? Should we not sync two 
secondary nodes at once?


I see there are some sync fixes in the RC for 9.2.8  , But not clear these may 
be related to rdma, I'm using tcp


Cheers, 

Tim

// Attachments 
[1] config


global {
        usage-count no;
        disable-ip-verification;
}

resource persist {
  device  /dev/drbd0;
  disk    /dev/strup-vg/strup-persist;

  disk {
    resync-rate 400M;
  }

  net {
    after-sb-1pri discard-secondary;
    verify-alg md5;
  }

  meta-disk /dev/strup-vg/strup-metapersist;

  on node-6927108042395771210 {
    node-id 1;
    address 172.20.0.1:7789;
  }

  on node-40274784292070177166 {
    node-id 2;
    address 172.20.0.2:7789;
  }

  on node-1010560022655352503 {
    node-id 3;
    address 172.20.0.3:7789;
  }

  connection-mesh {
    hosts node-6927108042395771210 node-40274784292070177166 
node-1010560022655352503;
  }
}

[2] Primary Log on disconnect/reconnect

[  450.949393] drbd persist node-40274784292070177166: PingAck did not arrive 
in time.
[  450.949413] drbd persist node-40274784292070177166: conn( Connected -> 
NetworkFailure ) peer( Secondary -> Unknown )
[  450.949419] drbd persist/0 drbd0 node-40274784292070177166: repl( SyncSource 
-> Off )
[  450.949440] drbd persist node-1010560022655352503: PingAck did not arrive in 
time.
[  450.949457] drbd persist node-1010560022655352503: conn( Connected -> 
NetworkFailure ) peer( Secondary -> Unknown )
[  450.949462] drbd persist/0 drbd0 node-1010560022655352503: repl( SyncSource 
-> Off )
[  450.949756] drbd persist node-40274784292070177166: Terminating sender thread
[  450.949792] drbd persist node-1010560022655352503: Terminating sender thread
[  450.949826] drbd persist node-1010560022655352503: Starting sender thread 
(from drbd_r_persist [8657])
[  450.949852] drbd persist node-40274784292070177166: Starting sender thread 
(from drbd_r_persist [8581])
[  450.956448] drbd persist node-1010560022655352503: meta connection shut down 
by peer.
[  450.956514] drbd persist node-40274784292070177166: meta connection shut 
down by peer.
[  450.992921] drbd persist node-1010560022655352503: Connection closed
[  450.992942] drbd persist node-1010560022655352503: helper command: 
/sbin/drbdadm disconnected
[  450.993325] drbd persist node-40274784292070177166: Connection closed
[  450.993355] drbd persist node-40274784292070177166: helper command: 
/sbin/drbdadm disconnected
[  450.995126] drbd persist node-1010560022655352503: helper command: 
/sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  450.995183] drbd persist node-1010560022655352503: conn( NetworkFailure -> 
Unconnected ) [disconnected]
[  450.995202] drbd persist node-1010560022655352503: Restarting receiver thread
[  450.995214] drbd persist node-1010560022655352503: conn( Unconnected -> 
Connecting ) [connecting]
[  450.995255] drbd persist node-1010560022655352503: Configured local address 
not found, retrying every 10 sec, err=-99
[  450.995771] drbd persist node-40274784292070177166: helper command: 
/sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  450.995818] drbd persist node-40274784292070177166: conn( NetworkFailure -> 
Unconnected ) [disconnected]
[  450.995875] drbd persist node-40274784292070177166: Restarting receiver 
thread
[  450.995886] drbd persist node-40274784292070177166: conn( Unconnected -> 
Connecting ) [connecting]
[  450.995949] drbd persist node-40274784292070177166: Configured local address 
not found, retrying every 10 sec, err=-99
[ 2960.265825] drbd persist node-40274784292070177166: Handshake to peer 2 
successful: Agreed network protocol version 122
[ 2960.265831] drbd persist node-40274784292070177166: Feature flags enabled on 
protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2960.266566] drbd persist: Preparing cluster-wide state change 2999333550 
(1->2 499/145)
[ 2960.269601] drbd persist node-1010560022655352503: Handshake to peer 3 
successful: Agreed network protocol version 122
[ 2960.269604] drbd persist node-1010560022655352503: Feature flags enabled on 
protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2960.275585] drbd persist/0 drbd0 node-40274784292070177166: 
drbd_sync_handshake:
[ 2960.275589] drbd persist/0 drbd0 node-40274784292070177166: self 
BFF38DFE83BC7649:0000000000000004:0000000000000000:0000000000000000 
bits:1576288 flags:120
[ 2960.275593] drbd persist/0 drbd0 node-40274784292070177166: peer 
BEFE83A748FAC586:0000000000000000:D31B9930119D4DEC:BEFE83A748FAC586 
bits:1554102 flags:1124
[ 2960.275597] drbd persist/0 drbd0 node-40274784292070177166: 
uuid_compare()=source-copy-other-bitmap by rule=bitmap-self-other
[ 2960.275599] drbd persist/0 drbd0 node-40274784292070177166: Copying bitmap 
of peer node_id=0 (bitmap_index=2)
[ 2960.300345] drbd persist: State change 2999333550: primary_nodes=2, 
weak_nodes=FFFFFFFFFFFFFFF9
[ 2960.300353] drbd persist: Committing cluster-wide state change 2999333550 
(33ms)
[ 2960.300389] drbd persist node-40274784292070177166: conn( Connecting -> 
Connected ) peer( Unknown -> Secondary ) [connected]
[ 2960.300394] drbd persist/0 drbd0 node-40274784292070177166: repl( Off -> 
WFBitMapS ) [connected]
[ 2960.303497] drbd persist/0 drbd0 node-40274784292070177166: send bitmap 
stats [Bytes(packets)]: plain 0(0), RLE 1046(1), total 1046; compression: 99.9%
[ 2960.306045] drbd persist/0 drbd0 node-40274784292070177166: receive bitmap 
stats [Bytes(packets)]: plain 0(0), RLE 1046(1), total 1046; compression: 99.9%
[ 2960.306058] drbd persist/0 drbd0 node-40274784292070177166: helper command: 
/sbin/drbdadm before-resync-source
[ 2960.306257] drbd persist/0 drbd0 node-40274784292070177166: helper command: 
/sbin/drbdadm before-resync-source exit code 255 (0xfffffffe)
[ 2960.306279] drbd persist/0 drbd0 node-40274784292070177166: repl( WFBitMapS 
-> SyncSource ) [receive-bitmap]
[ 2960.306349] drbd persist/0 drbd0 node-40274784292070177166: Began resync as 
SyncSource (will sync 244744 KB [61186 bits set]).
[ 2960.321526] drbd persist: Preparing cluster-wide state change 1704893824 
(1->3 499/145)
[ 2960.330564] drbd persist/0 drbd0 node-1010560022655352503: 
drbd_sync_handshake:
[ 2960.330569] drbd persist/0 drbd0 node-1010560022655352503: self 
BFF38DFE83BC7649:0000000000000004:D31B9930119D4DEC:0000000000000000 
bits:1912416 flags:120
[ 2960.330574] drbd persist/0 drbd0 node-1010560022655352503: peer 
0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000 
bits:1890230 flags:1124
[ 2960.330579] drbd persist/0 drbd0 node-1010560022655352503: 
uuid_compare()=source-set-bitmap by rule=just-created-peer
[ 2960.330582] drbd persist/0 drbd0 node-1010560022655352503: Setting and 
writing one bitmap slot, after drbd_sync_handshake
[ 2960.331946] drbd persist/0 drbd0: FIXME kworker/1:0[21] op clear, bitmap 
locked for 'set_n_write sync_handshake' by drbd_r_persist[8657]
[ 2960.334034] drbd persist/0 drbd0: bitmap page idx 52 changed during IO!
[ 2960.334040] drbd persist/0 drbd0: bitmap page idx 53 changed during IO!
[ 2960.334043] drbd persist/0 drbd0: bitmap page idx 55 changed during IO!
[ 2960.335522] drbd persist/0 drbd0 node-1010560022655352503: updated bitmap 
UUID BFF38DFE83BC7649:0000000000000004:D31B9930119D4DEC:0000000000000000
[ 2960.337827] drbd persist: State change 1704893824: primary_nodes=2, 
weak_nodes=FFFFFFFFFFFFFFF1
[ 2960.337834] drbd persist: Committing cluster-wide state change 1704893824 
(16ms)
[ 2960.341188] drbd persist node-1010560022655352503: conn( Connecting -> 
Connected ) peer( Unknown -> Secondary ) [connected]
[ 2960.341192] drbd persist/0 drbd0 node-1010560022655352503: repl( Off -> 
WFBitMapS ) [connected]
[ 2960.343712] drbd persist/0 drbd0 node-1010560022655352503: send bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2960.351952] drbd persist/0 drbd0 node-1010560022655352503: receive bitmap 
stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2960.351968] drbd persist/0 drbd0 node-1010560022655352503: helper command: 
/sbin/drbdadm before-resync-source
[ 2960.353490] drbd persist/0 drbd0 node-1010560022655352503: helper command: 
/sbin/drbdadm before-resync-source exit code 255 (0xfffffffe)
[ 2960.353558] drbd persist/0 drbd0 node-1010560022655352503: repl( WFBitMapS 
-> SyncSource ) [receive-bitmap]
[ 2960.353687] drbd persist/0 drbd0 node-1010560022655352503: Began resync as 
SyncSource (will sync 26599424 KB [6649856 bits set]).
[ 2963.145539] drbd persist/0 drbd0 node-40274784292070177166: updated UUIDs 
BFF38DFE83BC7649:0000000000000000:D31B9930119D4DEC:0000000000000000
[ 2963.150900] drbd persist/0 drbd0 node-40274784292070177166: Resync done 
(total 2 sec; paused 0 sec; 122372 K/sec)
[ 2963.150913] drbd persist/0 drbd0 node-40274784292070177166: pdsk( 
Inconsistent -> UpToDate ) repl( SyncSource -> Established ) [resync-finished]
[ 3004.217498] drbd persist/0 drbd0 node-1010560022655352503: updated UUIDs 
BFF38DFE83BC7649:0000000000000000:D31B9930119D4DEC:0000000000000000
[ 3004.228794] drbd persist/0 drbd0 node-1010560022655352503: Resync done 
(total 43 sec; paused 0 sec; 618588 K/sec)
[ 3004.228806] drbd persist/0 drbd0 node-1010560022655352503: pdsk( 
Inconsistent -> UpToDate ) repl( SyncSource -> Established ) [resync-finished]

[3] Backup log on disconnect/reconnect
[  295.174846] drbd persist node-6927108042395771210: Connection closed
[  295.174863] drbd persist node-6927108042395771210: helper command: 
/sbin/drbdadm disconnected
[  295.175088] drbd persist node-6927108042395771210: helper command: 
/sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  295.175113] drbd persist node-6927108042395771210: conn( NetworkFailure -> 
Unconnected ) [disconnected]
[  295.175131] drbd persist node-6927108042395771210: Restarting receiver thread
[  295.175142] drbd persist node-6927108042395771210: conn( Unconnected -> 
Connecting ) [connecting]
[ 2804.185311] drbd persist node-40274784292070177166: Preparing remote state 
change 2999333550
[ 2804.188050] drbd persist node-6927108042395771210: Handshake to peer 1 
successful: Agreed network protocol version 122
[ 2804.188059] drbd persist node-6927108042395771210: Feature flags enabled on 
protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2804.219640] drbd persist node-40274784292070177166: Committing remote state 
change 2999333550 (primary_nodes=2)
[ 2804.224622] drbd persist/0 drbd0 node-40274784292070177166: resync-susp( no 
-> peer ) [peer-state]
[ 2804.243564] drbd persist node-6927108042395771210: Preparing remote state 
change 1704893824
[ 2804.248755] drbd persist/0 drbd0 node-6927108042395771210: 
drbd_sync_handshake:
[ 2804.248761] drbd persist/0 drbd0 node-6927108042395771210: self 
0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000 
bits:1890230 flags:124
[ 2804.248769] drbd persist/0 drbd0 node-6927108042395771210: peer 
BFF38DFE83BC7649:0000000000000004:0000000000000000:0000000000000000 
bits:1912416 flags:1120
[ 2804.248777] drbd persist/0 drbd0 node-6927108042395771210: 
uuid_compare()=target-set-bitmap by rule=just-created-self
[ 2804.248783] drbd persist/0 drbd0 node-6927108042395771210: Setting and 
writing the whole bitmap, fresh node
[ 2804.259577] drbd persist node-6927108042395771210: Committing remote state 
change 1704893824 (primary_nodes=2)
[ 2804.259598] drbd persist node-6927108042395771210: conn( Connecting -> 
Connected ) peer( Unknown -> Primary ) [remote]
[ 2804.259607] drbd persist/0 drbd0 node-6927108042395771210: pdsk( DUnknown -> 
UpToDate ) repl( Off -> WFBitMapT ) [remote]
[ 2804.263408] drbd persist/0 drbd0 node-6927108042395771210: receive bitmap 
stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2804.268930] drbd persist/0 drbd0 node-6927108042395771210: send bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2804.268954] drbd persist/0 drbd0 node-6927108042395771210: helper command: 
/sbin/drbdadm before-resync-target
[ 2804.269181] drbd persist/0 drbd0 node-6927108042395771210: helper command: 
/sbin/drbdadm before-resync-target exit code 255 (0xfffffffe)
[ 2804.269194] drbd persist/0 drbd0 node-6927108042395771210: setting UUIDs to 
0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2804.269211] drbd persist/0 drbd0 node-6927108042395771210: repl( WFBitMapT 
-> SyncTarget ) [receive-bitmap]
[ 2804.269272] drbd persist/0 drbd0 node-6927108042395771210: Began resync as 
SyncTarget (will sync 26599424 KB [6649856 bits set]).
[ 2807.061425] drbd persist/0 drbd0 node-40274784292070177166: pdsk( 
Inconsistent -> UpToDate ) [peer-state]
[ 2807.061442] drbd persist/0 drbd0: Multiple UpToDate peers have different 
current UUIDs
[ 2807.061448] drbd persist/0 drbd0: node-40274784292070177166: 
BEFE83A748FAC586 node-6927108042395771210: BFF38DFE83BC7649
[ 2807.061453] drbd persist/0 drbd0: Setting exposed data uuid: BEFE83A748FAC586
[ 2807.061535] drbd persist/0 drbd0: Multiple UpToDate peers have different 
current UUIDs
[ 2807.061541] drbd persist/0 drbd0: node-40274784292070177166: 
BEFE83A748FAC586 node-6927108042395771210: BFF38DFE83BC7649
[ 2807.064043] drbd persist/0 drbd0 node-40274784292070177166: updated UUIDs 
0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2807.069167] drbd persist/0 drbd0: Setting exposed data uuid: BFF38DFE83BC7648
[ 2848.135754] drbd persist/0 drbd0 node-6927108042395771210: Resync done 
(total 43 sec; paused 0 sec; 618588 K/sec)
[ 2848.135776] drbd persist/0 drbd0 node-6927108042395771210: updated UUIDs 
BFF38DFE83BC7648:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2848.135801] drbd persist/0 drbd0: disk( Inconsistent -> UpToDate ) 
[resync-finished]
[ 2848.135809] drbd persist/0 drbd0 node-40274784292070177166: resync-susp( 
connection dependency -> no ) [resync-finished]
[ 2848.135818] drbd persist/0 drbd0 node-6927108042395771210: repl( SyncTarget 
-> Established ) [resync-finished]
[ 2848.136253] drbd persist/0 drbd0 node-6927108042395771210: helper command: 
/sbin/drbdadm after-resync-target
[ 2848.136483] drbd persist/0 drbd0 node-6927108042395771210: helper command: 
/sbin/drbdadm after-resync-target exit code 255 (0xfffffffe)


Reply via email to