Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Satish Duggana Tue, 21 Jan 2020 19:28:57 -0800

Hi Jun,
Can you please review the KIP and let us know your comments?

If there are no comments/questions, we can start a vote thread.


It looks like Yelp folks also encountered the same issue as mentioned
in JIRA comment[1].

>> Flavien Raynaud added a comment - Yesterday
We've seen offline partitions happening for the same reason in one of
our clusters too, where only the broker leader for the offline
partitions was having disk issues. It looks like there has not been
much progress/look on the PR submitted since December 9th. Is there
anything blocking this change from moving forward?

1. 
https://issues.apache.org/jira/browse/KAFKA-8733?focusedCommentId=17020083&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17020083

Thanks,
Satish.


On Thu, Dec 5, 2019 at 10:38 AM Harsha Chintalapani <[email protected]> wrote:
>
> Hi Jason,
>          As Satish said just increase replica max lag will not work in this
> case. Just before a disk dies the reads becomes really slow and its hard to
> estimate how much this is, as we noticed range is pretty wide. Overall it
> doesn't make sense to knock good replicas out of just because a leader is
> slower in processing reads or serving the fetch requests which may be due
> to disk issues in this case but could be other issues as well. I think this
> kip addresses in general all of these issues.
>          Do you still have questions on the current approach if not we can
> take it vote.
> Thanks,
> Harsha
>
>
> On Mon, Nov 18, 2019 at 7:05 PM, Satish Duggana <[email protected]>
> wrote:
>
> > Hi Jason,
> > Thanks for looking into the KIP. Apologies for my late reply. Increasing
> > replica max lag to 30-45 secs did not help as we observed that a few fetch
> > requests took more than 1-2 minutes. We do not want to increase further as
> > it increases upper bound on commit latency. We have strict SLAs on some of
> > the clusters on end to end(producer to consumer) latency. This proposal
> > improves the availability of partitions when followers are trying their
> > best to be insync even when leaders are slow in processing those requests.
> > I have updated the KIP to have a single config for giving backward
> > compatibility and I guess this config is more comprehensible than earlier.
> > But I believe there is no need to have config because the suggested
> > proposal in the KIP is an enhancement to the existing behavior. Please let
> > me know your comments.
> >
> > Thanks,
> > Satish.
> >
> > On Thu, Nov 14, 2019 at 10:57 AM Jason Gustafson <[email protected]>
> > wrote:
> >
> > Hi Satish,
> >
> > Thanks for the KIP. I'm wondering how much of this problem can be
> > addressed just by increasing the replication max lag? That was one of the
> > purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
> > configurations seem quite low level. I think they will be hard for users to
> > understand (even reading through a couple times I'm not sure I understand
> > them fully). I think if there's a way to improve this behavior without
> > requiring any new configurations, it would be much more attractive.
> >
> > Best,
> > Jason
> >
> > On Wed, Nov 6, 2019 at 8:14 AM Satish Duggana <[email protected]>
> > wrote:
> >
> > Hi Dhruvil,
> > Thanks for looking into the KIP.
> >
> > 10. I have an initial sketch of the KIP-500 in commit[a] which discusses
> > tracking the pending fetch requests. Tracking is not done in
> > Partition#readRecords because if it takes longer in reading any of the
> > partitions then we do not want any of the replicas of this fetch request to
> > go out of sync.
> >
> > 11. I think `Replica` class should be thread-safe to handle the remote
> > scenario of concurrent requests running for a follower replica. Or I may be
> > missing something here. This is a separate issue from KIP-500. I will file
> > a separate JIRA to discuss that issue.
> >
> > a -
> > https://github.com/satishd/kafka/commit/
> > c69b525abe8f6aad5059236076a003cdec4c4eb7
> >
> > Thanks,
> > Satish.
> >
> > On Tue, Oct 29, 2019 at 10:57 AM Dhruvil Shah <[email protected]>
> > wrote:
> >
> > Hi Satish,
> >
> > Thanks for the KIP, those seems very useful. Could you elaborate on how
> > pending fetch requests are tracked?
> >
> > Thanks,
> > Dhruvil
> >
> > On Mon, Oct 28, 2019 at 9:43 PM Satish Duggana <[email protected]
> >
> > wrote:
> >
> > Hi All,
> > I wrote a short KIP about avoiding out-of-sync or offline partitions when
> > follower fetch requests are not processed in time by the leader replica.
> > KIP-501 is located at https://s.apache.org/jhbpn
> >
> > Please take a look, I would like to hear your feedback and suggestions.
> >
> > JIRA: https://issues.apache.org/jira/browse/KAFKA-8733
> >
> > Thanks,
> > Satish.
> >
> >

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Reply via email to