Hi Dong, I think AtMinIsr is still valuable to indicate cluster is at a critical state and something needs to be done asap to restore. To your example " let's say min_isr = 1 and replica_set_size = 3, it is > still possible that planned maintenance (e.g. one broker restart + > partition reassignment) can cause isr size drop to 1. Since AtMinIsr can > also cause fault positive (i.e. the fact that AtMinIsr > 0 does not > necessarily need attention from user), "
One broker restart shouldn't cause ISR to drop to 1 from 3 unless 2 partitions are co-located on the same broker. This is still a valuable indicator to the admins that the partition assignment needs to be moved. In our case, we run 4 replicas for critical topics with min.isr = 2 . URPs are not really good indicator to take immediate action if one of the replicas is down. If 2 replicas are down and we are at 2 alive replicas this is stop everything to restore the cluster to a good state. Thanks, Harsha On Wed, Feb 27, 2019, at 11:17 PM, Dong Lin wrote: > Hey Kevin, > > Thanks for the update. > > The KIP suggests that AtMinIsr is better than UnderReplicatedPartition as > indicator for alerting. However, in most case where min_isr = > replica_set_size - 1, these two metrics are exactly the same, where planned > maintenance can easily cause positive AtMinIsr value. In the other > scenario, for example let's say min_isr = 1 and replica_set_size = 3, it is > still possible that planned maintenance (e.g. one broker restart + > partition reassignment) can cause isr size drop to 1. Since AtMinIsr can > also cause fault positive (i.e. the fact that AtMinIsr > 0 does not > necessarily need attention from user), I am not sure it is worth to add > this metric. > > In the Usage section, it is mentioned that user needs to manually check > whether there is ongoing maintenance after AtMinIsr is triggered. Could you > explain how is this different from the current way where we use > UnderReplicatedPartition to trigger alert? More specifically, can we just > replace AtMinIsr with UnderReplicatedPartition in the Usage section? > > Thanks, > Dong > > > On Tue, Feb 26, 2019 at 6:49 PM Kevin Lu <lu.ke...@berkeley.edu> wrote: > > > Hi Dong! > > > > Thanks for the feedback! > > > > You bring up a good point in that the AtMinIsr metric cannot be used to > > identify failure in the mentioned scenarios. I admit the motivation section > > placed too much emphasis on "identifying failure". > > > > I have modified the KIP to reflect the implementation as the AtMinIsr > > metric is intended to serve as a warning as one more failure to a partition > > AtMinIsr will cause producers with acks=ALL configured to fail. It has an > > additional benefit when minIsr=1 as it will warn us that the entire > > partition is at risk of going offline, but that is more of a side effect > > that only applies in that scenario (minIsr=1). > > > > Regards, > > Kevin > > > > On Tue, Feb 26, 2019 at 5:11 PM Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Kevin, > > > > > > Thanks for the proposal! > > > > > > It seems that the proposed implementation does not match the motivation. > > > The motivation suggests that the operator wants to tell the planned > > > maintenance (e.g. broker restart) from unplanned failure (e.g. network > > > failure). But the use of the metric AtMinIsr does not really > > differentiate > > > between these causes of the reduced number of ISR. For example, an > > > unplanned failure can cause ISR to drop from 3 to 2 but it can still be > > > higher than the minIsr (say 1). And a planned maintenance can cause ISR > > to > > > drop from 3 to 2, which trigger the AtMinIsr metric if minIsr=2. Can you > > > update the design doc to fix or explain this issue? > > > > > > Thanks, > > > Dong > > > > > > On Tue, Feb 12, 2019 at 9:02 AM Kevin Lu <lu.ke...@berkeley.edu> wrote: > > > > > > > Hi All, > > > > > > > > Getting the discussion thread started for KIP-427 in case anyone is > > free > > > > right now. > > > > > > > > I’d like to propose a new category of topic partitions *AtMinIsr* which > > > are > > > > partitions that only have the minimum number of in sync replicas left > > in > > > > the ISR set (as configured by min.insync.replicas). > > > > > > > > This would add two new metrics *ReplicaManager.AtMinIsrPartitionCount > > *& > > > > *Partition.AtMinIsr*, and a new TopicCommand option* > > > > --at-min-isr-partitions* to help in monitoring and alerting. > > > > > > > > KIP link: KIP-427: Add AtMinIsr topic partition category (new metric & > > > > TopicCommand option) > > > > < > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103089398 > > > > > > > > > > > > > Please take a look and let me know what you think. > > > > > > > > Regards, > > > > Kevin > > > > > > > > > >