John, sorry for my late reply and thanks for updating the KIP.
I like your approach about "metrics are for monitoring, logs are for debugging" -- however: 1) I don't see a connection between this and the task-level metrics that you propose to get the metrics in `TopologyTestDriver`. I don't think people would monitor the `TopologyTestDriver` an thus wondering why it is important to include the metrics there? Thread-level metric might be easier to monitor though (ie, less different metric to monitor). 2) I am a little worried about WARN level logging and that it might be too chatty -- as you pointed out, it's about debugging, thus DEBUG level might be better. Not 100% sure about this to be honest. What is the general assumption about the frequency for skipped records? I could imagine cases for which skipped records are quite frequent and thus, WARN level logs might "flood" the logs One final remark: > More > generally, I would like to establish a pattern in which we could add new > values for the "reason" tags without needing a KIP to do so. From my understanding, this is not feasible. Changing metrics is always considered a public API change, and we need a KIP for any change. As we moved away from tagging, it doesn't matter for the KIP anymore -- just wanted to point it out. -Matthias On 3/30/18 2:47 PM, John Roesler wrote: > Allrighty! The KIP is updated. > > Thanks again, all, for the feedback. > -John > > On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> wrote: > >> Hey Guozhang and Bill, >> >> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like to >> put it in the task instead of the thread so that it'll show up in the >> TopologyTestDriver metrics as well. >> >> I'm leaning toward keeping the scope where it is right now, but if others >> want to advocate for tossing in some more metrics, we can go that route. >> >> Thanks all, >> -John >> >> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> wrote: >> >>> Thanks for the KIP John, and sorry for the late comments. >>> >>> I'm on the fence with providing a single level metrics, but I think we'll >>> have that discussion outside of this KIP. >>> >>>> * maintain one skipped-record metric (could be per-thread, per-task, or >>>> per-processor-node) with no "reason" >>>> * introduce a warn-level log detailing the topic/partition/offset and >>>> reason of the skipped record >>> >>> I'm +1 on both of these suggestions. >>> >>> Finally, we have had requests in the past for some metrics around when >>> persistent store removes an expired window. Would adding that to our >>> metrics stretch the scope of this KIP too much? >>> >>> Thanks again and overall I'm +1 on this KIP >>> >>> Bill >>> >>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wangg...@gmail.com> >>> wrote: >>> >>>> The proposal sounds good to me. About "maintain only one level of >>> metrics" >>>> maybe we can discuss about that separately from this KIP since that >>> would >>>> be a larger scope of discussion. I agree that if we are going to >>> maintain >>>> only one-level metrics it should be lowest level and we would let users >>> to >>>> do the roll-ups themselves, but I'm still not fully convinced that we >>>> should just provide single-level metrics, because 1) I think for >>> different >>>> metrics people may be interested to investigate into different >>>> granularities, e.g. for poll / commit rate these are at the lowest >>>> task-level metrics, while for process-rate / skip-rate they can be as >>> low >>>> as processor-node metrics, and 2) user-side rolling ups may not be very >>>> straight-forward. But for 2) if someone can provide an efficient and >>> easy >>>> implementation of that I can be persuaded :) >>>> >>>> For now I'm thinking we can add the metric on thread-level, either with >>>> finer grained ones with "reason" tag plus an aggregated one without the >>>> tag, or just having a single aggregated metric without the tag looks >>> good >>>> to me. >>>> >>>> >>>> Guozhang >>>> >>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io> >>> wrote: >>>> >>>>> Hey Guozhang, >>>>> >>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a >>> list in >>>>> the KIP. I was also thinking we'd better start a documentation page >>> with >>>>> the metrics listed. >>>>> >>>>> I'd have no problem logging a warning when we skip records. On the >>> metric >>>>> front, really I'm just pushing for us to maintain only one level of >>>>> metrics. If that's more or less granular (i.e., maybe we don't have a >>>>> metric per reason and log the reason instead), that's fine by me. I >>> just >>>>> don't think it provides a lot of extra value per complexity (interface >>>> and >>>>> implementation) to maintain roll-ups at the thread level in addition >>> to >>>>> lower-level metrics. >>>>> >>>>> How about this instead: >>>>> * maintain one skipped-record metric (could be per-thread, per-task, >>> or >>>>> per-processor-node) with no "reason" >>>>> * introduce a warn-level log detailing the topic/partition/offset and >>>>> reason of the skipped record >>>>> >>>>> If you like that, I can update the KIP. >>>>> >>>>> Thanks, >>>>> -John >>>>> >>>>> >>>>> >>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangg...@gmail.com> >>>> wrote: >>>>> >>>>>>> One thing you mention is the notion of setting alerts on coarser >>>>> metrics >>>>>> being easier than finer ones. All the metric alerting systems I have >>>> used >>>>>> make it equally easy to alert on metrics by-tag or over tags. So my >>>>>> experience doesn't say that this is a use case. Were you thinking >>> of an >>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>> >>>>>> For the commonly used JMX reporter tags will be encoded directly as >>>> part >>>>> of >>>>>> the object name, and if users wants to monitor them they need to >>> know >>>>> these >>>>>> values before hand. That is also why I think we do want to list all >>> the >>>>>> possible values of the reason tags in the KIP, since >>>>>> >>>>>>> In my email in response to Matthias, I gave an example of the >>> kind of >>>>>> scenario that would lead me as an operator to run with DEBUG on all >>> the >>>>>> time, since I wouldn't be sure, having seen a skipped record once, >>> that >>>>> it >>>>>> would ever happen again. The solution is to capture all the >>> available >>>>>> information about the reason and location of skips all the time. >>>>>> >>>>>> That is a good point. I think we can either expose all levels >>> metrics >>>> as >>>>> by >>>>>> default, or only expose the most lower-level metrics and get rid of >>>> other >>>>>> levels to let users do roll-ups themselves (which will be a much >>> larger >>>>>> scope for discussion), or we can encourage users to not purely >>> depend >>>> on >>>>>> metrics for such trouble shooting: that is to say, users only be >>>> alerted >>>>>> based on metrics, and we can log a info / warn log4j entry each >>> time we >>>>> are >>>>>> about to skip a record all over the places, so that upon being >>> notified >>>>>> users can look into the logs to find the details on where / when it >>>>>> happens. WDYT? >>>>>> >>>>>> >>>>>> Guozhang >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io> >>>> wrote: >>>>>> >>>>>>> Hey Guozhang, >>>>>>> >>>>>>> Thanks for the review. >>>>>>> >>>>>>> 1. >>>>>>> Matthias raised the same question about the "reason" tag values. I >>>> can >>>>>> list >>>>>>> all possible values of the "reason" tag, but I'm thinking this >>> level >>>> of >>>>>>> detail may not be KIP-worthy, maybe the code and documentation >>> review >>>>>> would >>>>>>> be sufficient. If you all disagree and would like it included in >>> the >>>>>> KIP, I >>>>>>> can certainly do that. >>>>>>> >>>>>>> If we do provide roll-up metrics, I agree with the pattern of >>> keeping >>>>> the >>>>>>> same name but eliminating the tags for the dimensions that were >>>>>> rolled-up. >>>>>>> >>>>>>> 2. >>>>>>> I'm not too sure that implementation efficiency really becomes a >>>> factor >>>>>> in >>>>>>> choosing whether to (by default) update one coarse metric at the >>>> thread >>>>>>> level or one granular metric at the processor-node level, since >>> it's >>>>> just >>>>>>> one metric being updated either way. I do agree that if we were to >>>>> update >>>>>>> the granular metrics and multiple roll-ups, then we should >>> consider >>>> the >>>>>>> efficiency. >>>>>>> >>>>>>> I agree it's probably not necessary to surface the metrics for all >>>>> nodes >>>>>>> regardless of whether they can or do skip records. Perhaps we can >>>>> lazily >>>>>>> register the metrics. >>>>>>> >>>>>>> In my email in response to Matthias, I gave an example of the >>> kind of >>>>>>> scenario that would lead me as an operator to run with DEBUG on >>> all >>>> the >>>>>>> time, since I wouldn't be sure, having seen a skipped record once, >>>> that >>>>>> it >>>>>>> would ever happen again. The solution is to capture all the >>> available >>>>>>> information about the reason and location of skips all the time. >>>>>>> >>>>>>> >>>>>>> >>>>>>> One thing you mention is the notion of setting alerts on coarser >>>>> metrics >>>>>>> being easier than finer ones. All the metric alerting systems I >>> have >>>>> used >>>>>>> make it equally easy to alert on metrics by-tag or over tags. So >>> my >>>>>>> experience doesn't say that this is a use case. Were you thinking >>> of >>>> an >>>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>>> >>>>>>> Thanks again, >>>>>>> -John >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < >>> wangg...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>>> Hello John, >>>>>>>> >>>>>>>> Thanks for the KIP. Some comments: >>>>>>>> >>>>>>>> 1. Could you list all the possible values of the "reason" tag? >>> In >>>> the >>>>>>> JIRA >>>>>>>> ticket I left some potential reasons but I'm not clear if you're >>>>> going >>>>>> to >>>>>>>> categorize each of them as a separate reason, or is there any >>>>>> additional >>>>>>>> ones you have in mind. >>>>>>>> >>>>>>>> Also I'm wondering if we should add another metric that do not >>> have >>>>> the >>>>>>>> reason tag but aggregates among all possible reasons? This is >>> for >>>>> users >>>>>>> to >>>>>>>> easily set their alerting notifications (otherwise they have to >>>> write >>>>>> on >>>>>>>> notification rule per reason) in their monitoring systems. >>>>>>>> >>>>>>>> 2. Note that the processor-node metrics is actually "per-thread, >>>>>>> per-task, >>>>>>>> per-processor-node", and today we only set the per-thread >>> metrics >>>> as >>>>>> INFO >>>>>>>> while leaving the lower two layers as DEBUG. I agree with your >>>>> argument >>>>>>>> that we are missing the per-client roll-up metrics today, but >>> I'm >>>>>>> convinced >>>>>>>> that the right way to approach it would be >>>>> "just-providing-the-lowest- >>>>>>>> level >>>>>>>> metrics only". >>>>>>>> >>>>>>>> Note the recoding implementation of these three levels are >>>> different >>>>>>>> internally today: we did not just do the rolling up to generate >>> the >>>>>>>> higher-level metrics from the lower level ones, but we just >>> record >>>>> them >>>>>>>> separately, which means that, if we turn on multiple levels of >>>>> metrics, >>>>>>> we >>>>>>>> maybe duplicate collecting some metrics. One can argue that is >>> not >>>>> the >>>>>>> best >>>>>>>> way to represent multi-level metrics collecting and reporting, >>> but >>>> by >>>>>>> only >>>>>>>> enabling thread-level metrics as INFO today, that implementation >>>>> could >>>>>> be >>>>>>>> more efficient than only collecting the metrics at the lowest >>>> level, >>>>>> and >>>>>>>> then do the roll-up calculations outside of the metrics classes. >>>>>>>> >>>>>>>> Plus, today not all processor-nodes may possibly skip records, >>>> AFAIK >>>>> we >>>>>>>> will only skip records at the source, sink, window and >>> aggregation >>>>>>>> processor nodes, so adding a metric per processor looks like an >>>>>> overkill >>>>>>> to >>>>>>>> me as well. On the other hand, from user's perspective the >>> "reason" >>>>> tag >>>>>>> may >>>>>>>> be sufficient for them to narrow down where inside the topology >>> is >>>>>>> causing >>>>>>>> records to be dropped on the floor. So I think the "per-thread, >>>>>> per-task" >>>>>>>> level metrics should be sufficient for them in trouble shoot in >>>> DEBUG >>>>>>> mode, >>>>>>>> and we can add another "per-thread" level metrics as INFO which >>> is >>>>>> turned >>>>>>>> on by default. So under normal execution users still only need >>> INFO >>>>>> level >>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records >>>> metrics >>>>> as >>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG >>>>>> metrics >>>>>>> to >>>>>>>> look into which task is actually causing the skipped records. >>>>>>>> >>>>>>>> >>>>>>>> Guozhang >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < >>>>>> matth...@confluent.io> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the KIP John. >>>>>>>>> >>>>>>>>> Reading the material on the related Jiras, I am wondering what >>>>>> `reason` >>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should >>>> list >>>>>>> those >>>>>>>>> IMHO. >>>>>>>>> >>>>>>>>> About the fine grained metrics vs the roll-up: you say that >>>>>>>>> >>>>>>>>>> the coarse metric aggregates across two dimensions >>>> simultaneously >>>>>>>>> >>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm >>>> that >>>>>> we >>>>>>>>> should put the fine grained metrics into INFO level and remove >>>> the >>>>>>>>> roll-up at thread level. >>>>>>>>> >>>>>>>>>> Given that they have to do this sum to get a usable >>> top-level >>>>> view >>>>>>>>> >>>>>>>>> This is a fair concern, but I don't share the conclusion. >>>> Offering >>>>> a >>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might >>> be >>>> a >>>>>>>>> better solution. In the past we did not offer this due to >>>>> performance >>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you >>>>> disagree, >>>>>>> can >>>>>>>>> you provide some reasoning and add them to the "Rejected >>>>>> alternatives" >>>>>>>>> section. >>>>>>>>> >>>>>>>>> To rephrase: I understand the issue about missing top-level >>> view, >>>>> but >>>>>>>>> instead of going more fine grained, we should consider to add >>>> this >>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG >>>> level >>>>>>> only >>>>>>>>> >>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old >>>>> metrics >>>>>>>>> directly as you suggested. >>>>>>>>> >>>>>>>>> >>>>>>>>> -Matthias >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: >>>>>>>>>> Looks good to me. >>>>>>>>>> >>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < >>>> j...@confluent.io >>>>>> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hello all, >>>>>>>>>>> >>>>>>>>>>> I am proposing KIP-274 to improve the metrics around >>> skipped >>>>>> records >>>>>>>> in >>>>>>>>>>> Streams. >>>>>>>>>>> >>>>>>>>>>> Please find the details here: >>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics >>>>>>>>>>> >>>>>>>>>>> Please let me know what you think! >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> -John >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -- Guozhang >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- Guozhang >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> -- Guozhang >>>> >>> >> >> >
signature.asc
Description: OpenPGP digital signature