[ https://issues.apache.org/jira/browse/KAFKA-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369935#comment-17369935 ]
HaiyuanZhao edited comment on KAFKA-12958 at 6/26/21, 6:29 PM: --------------------------------------------------------------- Hi, [~jagsancio] I added an invariant that notified leaders are never asked to load snapshots. However, the test case canRecoverAfterAllNodesKilled failed, this is easy to reproduce. and the case detail is followed. *New Invariant* Verification logic is as followed {code:java} // java private static class LeaderNeverLoadSnapshot implements Invariant { final Cluster cluster; int epoch = 0; OptionalInt leaderId = OptionalInt.empty(); private LeaderNeverLoadSnapshot(Cluster cluster) { this.cluster = cluster; } @Override public void verify() { for (RaftNode raftNode : cluster.running()) { if (raftNode.counter.isLeader()) { assertFalse(raftNode.counter.isHandleSnapshotCalled()); assertTrue(raftNode.counter.getHandleSnapshotCalls() == 0); } else { if (raftNode.counter.isHandleSnapshotCalled()) { assertTrue(raftNode.counter.getHandleSnapshotCalls() > 0); } else { assertTrue(raftNode.counter.getHandleSnapshotCalls() == 0); } } } } } {code} *Run Result* The handleSnapshot root caller and callstack are as followed. This callstack indicated that the new leader may have a chance to catch up by loadingSnaphost if its listener is lagging. And the fireSnapshot call comes from KAFKA-12154, which revision is 6203bf8. I am not sure if this is expected. Could you please take a look? {code:java} // java private void onUpdateLeaderHighWatermark( LeaderState<T> state, long currentTimeMs ) { state.highWatermark().ifPresent(highWatermark -> { ... // It is also possible that the high watermark is being updated // for the first time following the leader election, so we need // to give lagging listeners an opportunity to catch up as well updateListenersProgress(highWatermark.offset); }); } {code} !image-2021-06-27-02-27-41-966.png! was (Author: zhaohaidao): Hi, [~jagsancio] I added an invariant that notified leaders are never asked to load snapshots. However, the test case canRecoverAfterAllNodesKilled failed, this is easy to reproduce. and the case detail is followed. *New Invariant* Verification logic is as followed {code:java} // java private static class LeaderNeverLoadSnapshot implements Invariant { final Cluster cluster; int epoch = 0; OptionalInt leaderId = OptionalInt.empty(); private LeaderNeverLoadSnapshot(Cluster cluster) { this.cluster = cluster; } @Override public void verify() { for (RaftNode raftNode : cluster.running()) { if (raftNode.counter.isLeader()) { assertFalse(raftNode.counter.isHandleSnapshotCalled()); assertTrue(raftNode.counter.getHandleSnapshotCalls() == 0); } else { if (raftNode.counter.isHandleSnapshotCalled()) { assertTrue(raftNode.counter.getHandleSnapshotCalls() > 0); } else { assertTrue(raftNode.counter.getHandleSnapshotCalls() == 0); } } } } } {code} *Run Result* The handleSnapshot callstack is as followed. This callstack indicated that the new leader may have a chance to catch up by loadingSnaphost if its listener is lagging. And the fireSnapshot call comes from KAFKA-12154, which revision is 6203bf8. I am not sure if this is expected. Could you please take a look? !image-2021-06-27-02-27-41-966.png! > Add simulation invariant for leadership and snapshot > ---------------------------------------------------- > > Key: KAFKA-12958 > URL: https://issues.apache.org/jira/browse/KAFKA-12958 > Project: Kafka > Issue Type: Sub-task > Reporter: Jose Armando Garcia Sancio > Assignee: HaiyuanZhao > Priority: Major > Attachments: image-2021-06-27-02-09-25-296.png, > image-2021-06-27-02-15-23-760.png, image-2021-06-27-02-26-48-368.png, > image-2021-06-27-02-27-41-966.png > > > During the simulation we should add an invariant that notified leaders are > never asked to load snapshots. The state machine always sees the following > sequence of callback calls: > Leaders see: > ... > handleLeaderChange state machine is notify of leadership > handleSnapshot is never called > Non-leader see: > ... > handleLeaderChange state machine is notify that is not leader > handleSnapshot is called 0 or more times -- This message was sent by Atlassian Jira (v8.3.4#803005)