Jason Gustafson created KAFKA-12181:
---------------------------------------

             Summary: Loosen monotonic fetch offset validation by raft leader
                 Key: KAFKA-12181
                 URL: https://issues.apache.org/jira/browse/KAFKA-12181
             Project: Kafka
          Issue Type: Sub-task
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


Currently in the Raft's leader implementation, we validate that follower fetch 
offsets increase monotonically. This protects the guarantees that Raft provides 
since a non-monotonic update means that the follower has lost committed data, 
which may or may not result in data loss. It depends whether the update also 
causes a non-monotonic update to the high watermark. If the fetch is from an 
observer, no harm done since observers do not affect the high watermark. If the 
fetch is from a voter and a majority of nodes (excluding the fetcher) have 
offsets larger than or equal to the high watermark, also no harm done. It's 
easy to check for these cases and log a warning instead of raising an error.

The question then is what to do if we get a voter fetch which does cause the 
high watermark to regress? The problem is that there are some scenarios where 
data loss might be unavoidable. For example, a follower's disk might become 
corrupt and ultimately get replaced. Often the damage is already done by the 
time we get the Fetch request with the non-monotonic offset, so the stricter 
validation in fact just prevents recovery. 

It's worth noting also that the stricter validation by the leader cannot be 
relied on to detect data loss. It could be the case that a recovered voter 
restarts in the middle of an election. There is no general way that I'm aware 
of that lets us detect when a voter has lost previously committed data.

With all of this mind, my conclusion is that it makes sense to loosen the 
validation in fetches. The leader can still ensure that its high watermark does 
not go backwards and we can still log a warning, but it should not prevent 
replicas from catching up after hard failures with disk state loss. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to