Hi all, Here at Dropbox we're still (off and on) trying to get to the bottom of the data loss that's been hitting our largest cluster during preferred replica elections. It unfortunately has repeated a few times, so now we have a question.
To make sure we're understanding, message commit status (replication) basically goes through three phases: * Messages Uncommitted; the leader has received new production, and the followers receive the messages in a Fetch request. Everybody's LEO is incremented, but HWMs are unchanged. No messages are considered committed. * Leader Committed; in the subsequent Fetch, the follower says "my LEO is now X" and the leader records that and then updates its HWM to be the minimum of all follower's LEO. After this stage, the messages are committed -- only on the leader. The follower's HWM is still in the past. * Leader+Follower Committed; or HWM Replication Fetch; in this final fetch, the follower is informed of the new HWM by the leader and increments its own HWM accordingly. These aren't actually distinct phases of course, they're just part of individual fetch requests, but I think logically the HWM transitions can be thought of in those way. I.e., "message uncommitted", "leader committed", "leader+follower committed". Is this understanding accurate? If so, is it just a fact that (right now) a preferred replica election has a small change of electing a follower and causing message loss of any "leader committed" messages (i.e., messages that are not considered committed on the follower that is now getting promoted)? We can't find anything in the protocol that would guard against this. I've also been reading KIP-101 and it looks like this is being referred to sort-of in Scenario 1, however, that scenario is mentioning broker failure -- and my concern is that data loss is possible even in the normal scenario with no broker failures. Any thoughts? -- Mark Smith m...@qq.is