Hi Lucas, Yes the case you mentioned is true. I do understand KIP-501 might not fully solve this particular use case where there might blocked fetch requests. But the issue we noticed multiple times and continue to notice is 1. Fetch request comes from Follower 2. Leader tries to fetch data from disk which takes longer than replica.lag.time.max.ms 3. Async thread on leader side which checks the ISR marks the follower who sent a fetch request as not in ISR 4. Leader dies during this request due to disk errors and now we have offline partitions because Leader kicked out healthy followers out of ISR
Instead of considering this from a disk issue. Lets look at how we maintain the ISR 1. Currently we do not consider a follower as healthy even when its able to send fetch requests 2. ISR is controlled on how healthy a broker is, ie if it takes longer than replica.lag.time.max.ms we mark followers out of sync instead of relinquishing the leadership. What we are proposing in this KIP, we should look at the time when a follower sends a fetch request and keep that as basis for marking a follower out of ISR or to keep it in the ISR and leave the disk read time on leader side out of this. Thanks, Harsha On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io> wrote: > Hi Harsha, > > Is the problem you'd like addressed the following? > > Assume 3 replicas, L and F1 and F2. > > 1. F1 and F2 are alive and sending fetch requests to L. > 2. L starts encountering disk issues, any requests being processed by the > request handler threads become blocked. > 3. L's zookeeper connection is still alive so it remains the leader for > the partition. > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR > to itself. > > While KIP-501 may help prevent a shrink in partitions where a replica > fetch request has started processing, any fetch requests in the request > queue will have no effect. Generally when these slow/failing disk issues > occur, all of the request handler threads end up blocked and requests queue > up in the request queue. For example, all of the request handler threads > may end up stuck in > KafkaApis.handleProduceRequest handling produce requests, at which point > all of the replica fetcher fetch requests remain queued in the request > queue. If this happens, there will be no tracked fetch requests to prevent > a shrink. > > Solving this shrinking issue is tricky. It would be better if L resigns > leadership when it enters a degraded state rather than avoiding a shrink. > If L is no longer the leader in this situation, it will eventually become > blocked fetching from the new leader and the new leader will shrink the > ISR, kicking out L. > > Cheers, > > Lucas >