It seems to me what this is driving at is to list partitions that have no more fault-tolerance. Maybe the wording could reflect that? That rewording would have the benefit of covering 1 of 3 in sync, or 2 of 3 in sync with min.insync.replicas=2.
I'm not convinced with your labelling of WARNING / CRITICAL. It seems to be making assumptions on behalf of the topic owners. For example: If there's a "two rack" cluster (say two DCs) and keeping written data is absolutely critical, it's advisable to configure for rf=4, min.insync=3. having only 3 out of 4 would mean no more HA. Depending on the ops culture there that may be considered a warning or critical. I would call it a warning. If having a topic being unavailable must be avoided, I'd call it a warning requiring immediate attention. If it dropped to 2/4, with acks=all that would then be a critical/fatal. But according to your KIP you'd probably label that a warning. I'd suggest considering wording that doesn't introduce ambiguity or assume interpretations of what should be a warning or more severe. On Sat, Aug 25, 2018 at 3:46 AM Kevin Lu <lu.ke...@berkeley.edu> wrote: > Hi All, > > I am having some trouble re-formulating this KIP to output partitions that > are under the configured "min.insync.replicas" as I am not sure how to > reliably get the configured "min.insync.replicas" in all cases. > > The challenge I am facing is when "min.insync.replicas" is configured to > non-default on the broker, and topics are created without > "min.insync.replicas" specified. Since the topic is created without > specifying "min.insync.replicas", then the value is not saved in Zookeeper > and it is directly used by the brokers. > > The TopicCommand hits zookeeper so I am unable to get the configured value > without querying the brokers somehow... > > Example: > - Start a broker with min.insync.replicas=2 in server.properties > - Use kafka-topics.sh to create topic without specifying > min.insync.replicas > > The ZK node /config/ for the created topic will only have direct overrides, > and will not have the broker's configured min.insync.replicas. > > Any ideas on how to approach this? > > Regards, > Kevin > > On Mon, Aug 6, 2018 at 8:21 AM Kevin Lu <lu.ke...@berkeley.edu> wrote: > > > Hi Jason, > > > > Thanks for the response! > > > > I completely agree with you and Mickael about adding a > > --under-minisr-partitions option to match the existing metric. I will > > create a separate KIP to discuss the --under-minisr-partitions option. I > > believe there is a technical challenge with retrieving the > > min.insync.replicas configuration from zookeeper currently as it may also > > be stored as a broker configuration, but let me do some digging to > confirm. > > > > I am going to modify KIP-351 to represent the the gap that you have > > mentioned (exactly at min.isr) as this is an important state that we > > specifically monitor to alert on. > > > > Any other thoughts? > > > > Regards, > > Kevin > > > > On Thu, Aug 2, 2018 at 11:23 PM Jason Gustafson <ja...@confluent.io> > > wrote: > > > >> Hey Kevin, > >> > >> Thanks for the KIP. I like Mickael's suggestion to > >> add --under-minisr-partitions since it fits with the metric we already > >> expose. It's also a good question whether there should be a separate > >> category for partitions which are right at min.isr. I'm reluctant to add > >> new categories, but I agree there might be a gap at the moment. Say you > >> have a replication factor of 3 and the min isr is set to 1. Our notion > of > >> URP does not capture the difference between having an ISR down to a size > >> of > >> 1 and one down to a size of 2. The reason this might be significant is > >> that > >> a shrink of the ISR down to 2 may just be caused by a rolling restart > or a > >> transient network blip. A shrink to 1, on the other hand, might be > >> indicative of a more severe problem and could be cause for a call from > >> pagerduty. > >> > >> -Jason > >> > >> On Thu, Aug 2, 2018 at 9:28 AM, Kevin Lu <lu.ke...@berkeley.edu> wrote: > >> > >> > Hi Mickael, > >> > > >> > Thanks for the suggestion! > >> > > >> > Correct me if I am mistaken, but if a producer attempts to send to a > >> > partition that is under min ISR (and ack=all or -1) then the send will > >> fail > >> > with a NotEnoughReplicas or NotEnoughReplicasAfterAppend exception? At > >> this > >> > point, client-side has already suffered failure but the server-side is > >> > still fine for now? > >> > > >> > If the above is true, then this would be a FATAL case for producers. > >> > > >> > Would it be valuable to include the CRITICAL case where a topic > >> partition > >> > has exactly min ISR so that Kafka operators can take action so it does > >> not > >> > become FATAL? This could be in the same option or a new one. > >> > > >> > Thanks! > >> > > >> > Regards, > >> > Kevin > >> > > >> > On Thu, Aug 2, 2018 at 2:27 AM Mickael Maison < > mickael.mai...@gmail.com > >> > > >> > wrote: > >> > > >> > > What about also adding a --under-minisr-partitions option? > >> > > > >> > > That would match the > >> > > "kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount" > >> > > broker metric and it's usually pretty relevant when investigating > >> > > issues > >> > > > >> > > On Thu, Aug 2, 2018 at 8:54 AM, Kevin Lu <lu.ke...@berkeley.edu> > >> wrote: > >> > > > Hi friends! > >> > > > > >> > > > This thread is to discuss KIP-351 > >> > > > < > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-351 > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-351> > >> > +Add+--critical-partitions+option+to+describe+topics+command > >> > > > > >> > > > ! > >> > > > > >> > > > I am proposing to add a --critical-partitions option to the > describe > >> > > topics > >> > > > command that will only list out topic partitions that have 1 ISR > >> left > >> > > (RF > > >> > > > 1) as they would be in a critical state and need immediate > >> > > repartitioning. > >> > > > > >> > > > I wonder if the name "critical" is appropriate? > >> > > > > >> > > > Thoughts? > >> > > > > >> > > > Thanks! > >> > > > > >> > > > Regards, > >> > > > Kevin > >> > > > >> > > >> > > >