I think Matthias' comment is that, we can still record the metrics on the thread-level, while having the WARN log entry to include sufficient context information so that users can still easily narrow down the investigation scope.
Guozhang On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote: > I agree we should add as much information as is reasonable to the log. For > example, see this WIP PR I started for this KIP: > > https://github.com/apache/kafka/pull/4812/files#diff- > 88d129f048bc842c7db5b2566a45fce8R80 > > and > > https://github.com/apache/kafka/pull/4812/files#diff- > 69e6789eb675ec978a1abd24fed96eb1R111 > > I'm not sure if we should nail down the log messages in the KIP or in the > PR discussion. What say you? > > Thanks, > -John > > On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io> > wrote: > > > Thanks for sharing your thoughts. As I mentioned originally, I am not > > sure about the right log level either. Your arguments are convincing -- > > thus, I am fine with keeping WARN level. > > > > The task vs thread level argument is an interesting one. However, I am > > wondering if we should add this information into the corresponding WARN > > logs that we write anyway? For this case, we can also log the > > corresponding operator (and other information like topic name etc if > > needed). WDYT about this? > > > > > > -Matthias > > > > On 4/2/18 8:31 PM, Guozhang Wang wrote: > > > Regarding logging: I'm inclined to keep logging at WARN level since > > skipped > > > records are not expected in normal execution (for all reasons that we > are > > > aware of), and hence when error happens users should be alerted from > > > metrics and looked into the log files, so to me if it is really > spamming > > > the log files it is also a good alert for users. Besides for > deserialize > > > errors we already log at WARN level for this reason. > > > > > > Regarding the metrics-levels: I was pondering on that as well. What > made > > me > > > to think and agree on task-level than thread-level is that for some > > reasons > > > like window retention, they may possibly be happening on a subset of > > input > > > partitions, and tasks are correlated with partitions the task-level > > metrics > > > can help users to narrow down on the specific input data partitions. > > > > > > > > > Guozhang > > > > > > > > > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io> > wrote: > > > > > >> Hi Matthias, > > >> > > >> No worries! Thanks for the reply. > > >> > > >> 1) There isn't a connection. I tried using the TopologyTestDriver to > > write > > >> a quick test exercising the current behavior and discovered that the > > >> metrics weren't available. It seemed like they should be, so I tacked > > it on > > >> to this KIP. If you feel it's inappropriate, I can pull it back out. > > >> > > >> 2) I was also concerned about that, but I figured it would come up in > > >> discussion if I just went ahead and proposed it. And here we are! > > >> > > >> Here's my thought: maybe there are two classes of skips: "controlled" > > and > > >> "uncontrolled", where "controlled" means, as an app author, I > > deliberately > > >> filter out some events, and "uncontrolled" means that I simply don't > > >> account for some feature of the data, and the framework skips them (as > > >> opposed to crashing). > > >> > > >> In this breakdowns, the skips I'm adding metrics for are all > > uncontrolled > > >> skips (and we hope to measure all the uncontrolled skips). Our skips > are > > >> well documented, so it wouldn't be terrible to have an application in > > which > > >> you know you expect to have tons of uncontrolled skips, but it's not > > great > > >> either, since you may also have some *unexpected* uncontrolled skips. > > It'll > > >> be difficult to notice, since you're probably not alerting on the > metric > > >> and filtering out the logs (whatever their level). > > >> > > >> I'd recommend any app author, as an alternative, to convert all > expected > > >> skips to controlled ones, by updating the topology to filter those > > records > > >> out. > > >> > > >> Following from my recommendation, as a library author, I'm inclined to > > mark > > >> those logs WARN, since in my opinion, they should be concerning to the > > app > > >> authors. I'd definitely want to show, rather than hide, them by > > default, so > > >> I would pick INFO at least. > > >> > > >> That said, logging is always a tricky issue for lower-level libraries > > that > > >> run inside user code, since we don't have all the information we need > to > > >> make the right call. > > >> > > >> > > >> > > >> On your last note, yeah, I got that impression from Guozhang as well. > > >> Thanks for the clarification. > > >> > > >> -John > > >> > > >> > > >> > > >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < > matth...@confluent.io> > > >> wrote: > > >> > > >>> John, > > >>> > > >>> sorry for my late reply and thanks for updating the KIP. > > >>> > > >>> I like your approach about "metrics are for monitoring, logs are for > > >>> debugging" -- however: > > >>> > > >>> 1) I don't see a connection between this and the task-level metrics > > that > > >>> you propose to get the metrics in `TopologyTestDriver`. I don't think > > >>> people would monitor the `TopologyTestDriver` an thus wondering why > it > > >>> is important to include the metrics there? Thread-level metric might > be > > >>> easier to monitor though (ie, less different metric to monitor). > > >>> > > >>> 2) I am a little worried about WARN level logging and that it might > be > > >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG > > level > > >>> might be better. Not 100% sure about this to be honest. What is the > > >>> general assumption about the frequency for skipped records? I could > > >>> imagine cases for which skipped records are quite frequent and thus, > > >>> WARN level logs might "flood" the logs > > >>> > > >>> One final remark: > > >>> > > >>>> More > > >>>> generally, I would like to establish a pattern in which we could add > > >> new > > >>>> values for the "reason" tags without needing a KIP to do so. > > >>> > > >>> From my understanding, this is not feasible. Changing metrics is > always > > >>> considered a public API change, and we need a KIP for any change. As > we > > >>> moved away from tagging, it doesn't matter for the KIP anymore -- > just > > >>> wanted to point it out. > > >>> > > >>> > > >>> -Matthias > > >>> > > >>> > > >>> On 3/30/18 2:47 PM, John Roesler wrote: > > >>>> Allrighty! The KIP is updated. > > >>>> > > >>>> Thanks again, all, for the feedback. > > >>>> -John > > >>>> > > >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> > > >> wrote: > > >>>> > > >>>>> Hey Guozhang and Bill, > > >>>>> > > >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd > > like > > >>> to > > >>>>> put it in the task instead of the thread so that it'll show up in > the > > >>>>> TopologyTestDriver metrics as well. > > >>>>> > > >>>>> I'm leaning toward keeping the scope where it is right now, but if > > >>> others > > >>>>> want to advocate for tossing in some more metrics, we can go that > > >> route. > > >>>>> > > >>>>> Thanks all, > > >>>>> -John > > >>>>> > > >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> > > >> wrote: > > >>>>> > > >>>>>> Thanks for the KIP John, and sorry for the late comments. > > >>>>>> > > >>>>>> I'm on the fence with providing a single level metrics, but I > think > > >>> we'll > > >>>>>> have that discussion outside of this KIP. > > >>>>>> > > >>>>>>> * maintain one skipped-record metric (could be per-thread, > > per-task, > > >>> or > > >>>>>>> per-processor-node) with no "reason" > > >>>>>>> * introduce a warn-level log detailing the topic/partition/offset > > >> and > > >>>>>>> reason of the skipped record > > >>>>>> > > >>>>>> I'm +1 on both of these suggestions. > > >>>>>> > > >>>>>> Finally, we have had requests in the past for some metrics around > > >> when > > >>>>>> persistent store removes an expired window. Would adding that to > > our > > >>>>>> metrics stretch the scope of this KIP too much? > > >>>>>> > > >>>>>> Thanks again and overall I'm +1 on this KIP > > >>>>>> > > >>>>>> Bill > > >>>>>> > > >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < > wangg...@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> The proposal sounds good to me. About "maintain only one level of > > >>>>>> metrics" > > >>>>>>> maybe we can discuss about that separately from this KIP since > that > > >>>>>> would > > >>>>>>> be a larger scope of discussion. I agree that if we are going to > > >>>>>> maintain > > >>>>>>> only one-level metrics it should be lowest level and we would let > > >>> users > > >>>>>> to > > >>>>>>> do the roll-ups themselves, but I'm still not fully convinced > that > > >> we > > >>>>>>> should just provide single-level metrics, because 1) I think for > > >>>>>> different > > >>>>>>> metrics people may be interested to investigate into different > > >>>>>>> granularities, e.g. for poll / commit rate these are at the > lowest > > >>>>>>> task-level metrics, while for process-rate / skip-rate they can > be > > >> as > > >>>>>> low > > >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not > be > > >>> very > > >>>>>>> straight-forward. But for 2) if someone can provide an efficient > > and > > >>>>>> easy > > >>>>>>> implementation of that I can be persuaded :) > > >>>>>>> > > >>>>>>> For now I'm thinking we can add the metric on thread-level, > either > > >>> with > > >>>>>>> finer grained ones with "reason" tag plus an aggregated one > without > > >>> the > > >>>>>>> tag, or just having a single aggregated metric without the tag > > looks > > >>>>>> good > > >>>>>>> to me. > > >>>>>>> > > >>>>>>> > > >>>>>>> Guozhang > > >>>>>>> > > >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io > > > > >>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hey Guozhang, > > >>>>>>>> > > >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide > a > > >>>>>> list in > > >>>>>>>> the KIP. I was also thinking we'd better start a documentation > > page > > >>>>>> with > > >>>>>>>> the metrics listed. > > >>>>>>>> > > >>>>>>>> I'd have no problem logging a warning when we skip records. On > the > > >>>>>> metric > > >>>>>>>> front, really I'm just pushing for us to maintain only one level > > of > > >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't > > >> have a > > >>>>>>>> metric per reason and log the reason instead), that's fine by > me. > > I > > >>>>>> just > > >>>>>>>> don't think it provides a lot of extra value per complexity > > >>> (interface > > >>>>>>> and > > >>>>>>>> implementation) to maintain roll-ups at the thread level in > > >> addition > > >>>>>> to > > >>>>>>>> lower-level metrics. > > >>>>>>>> > > >>>>>>>> How about this instead: > > >>>>>>>> * maintain one skipped-record metric (could be per-thread, > > >> per-task, > > >>>>>> or > > >>>>>>>> per-processor-node) with no "reason" > > >>>>>>>> * introduce a warn-level log detailing the > topic/partition/offset > > >> and > > >>>>>>>> reason of the skipped record > > >>>>>>>> > > >>>>>>>> If you like that, I can update the KIP. > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> -John > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < > > wangg...@gmail.com > > >>> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>>> One thing you mention is the notion of setting alerts on > coarser > > >>>>>>>> metrics > > >>>>>>>>> being easier than finer ones. All the metric alerting systems I > > >> have > > >>>>>>> used > > >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. > So > > >> my > > >>>>>>>>> experience doesn't say that this is a use case. Were you > thinking > > >>>>>> of an > > >>>>>>>>> alerting system that makes such a pre-aggregation valuable? > > >>>>>>>>> > > >>>>>>>>> For the commonly used JMX reporter tags will be encoded > directly > > >> as > > >>>>>>> part > > >>>>>>>> of > > >>>>>>>>> the object name, and if users wants to monitor them they need > to > > >>>>>> know > > >>>>>>>> these > > >>>>>>>>> values before hand. That is also why I think we do want to list > > >> all > > >>>>>> the > > >>>>>>>>> possible values of the reason tags in the KIP, since > > >>>>>>>>> > > >>>>>>>>>> In my email in response to Matthias, I gave an example of the > > >>>>>> kind of > > >>>>>>>>> scenario that would lead me as an operator to run with DEBUG on > > >> all > > >>>>>> the > > >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record > > once, > > >>>>>> that > > >>>>>>>> it > > >>>>>>>>> would ever happen again. The solution is to capture all the > > >>>>>> available > > >>>>>>>>> information about the reason and location of skips all the > time. > > >>>>>>>>> > > >>>>>>>>> That is a good point. I think we can either expose all levels > > >>>>>> metrics > > >>>>>>> as > > >>>>>>>> by > > >>>>>>>>> default, or only expose the most lower-level metrics and get > rid > > >> of > > >>>>>>> other > > >>>>>>>>> levels to let users do roll-ups themselves (which will be a > much > > >>>>>> larger > > >>>>>>>>> scope for discussion), or we can encourage users to not purely > > >>>>>> depend > > >>>>>>> on > > >>>>>>>>> metrics for such trouble shooting: that is to say, users only > be > > >>>>>>> alerted > > >>>>>>>>> based on metrics, and we can log a info / warn log4j entry each > > >>>>>> time we > > >>>>>>>> are > > >>>>>>>>> about to skip a record all over the places, so that upon being > > >>>>>> notified > > >>>>>>>>> users can look into the logs to find the details on where / > when > > >> it > > >>>>>>>>> happens. WDYT? > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Guozhang > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < > j...@confluent.io > > > > > >>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hey Guozhang, > > >>>>>>>>>> > > >>>>>>>>>> Thanks for the review. > > >>>>>>>>>> > > >>>>>>>>>> 1. > > >>>>>>>>>> Matthias raised the same question about the "reason" tag > values. > > >> I > > >>>>>>> can > > >>>>>>>>> list > > >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this > > >>>>>> level > > >>>>>>> of > > >>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation > > >>>>>> review > > >>>>>>>>> would > > >>>>>>>>>> be sufficient. If you all disagree and would like it included > in > > >>>>>> the > > >>>>>>>>> KIP, I > > >>>>>>>>>> can certainly do that. > > >>>>>>>>>> > > >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of > > >>>>>> keeping > > >>>>>>>> the > > >>>>>>>>>> same name but eliminating the tags for the dimensions that > were > > >>>>>>>>> rolled-up. > > >>>>>>>>>> > > >>>>>>>>>> 2. > > >>>>>>>>>> I'm not too sure that implementation efficiency really > becomes a > > >>>>>>> factor > > >>>>>>>>> in > > >>>>>>>>>> choosing whether to (by default) update one coarse metric at > the > > >>>>>>> thread > > >>>>>>>>>> level or one granular metric at the processor-node level, > since > > >>>>>> it's > > >>>>>>>> just > > >>>>>>>>>> one metric being updated either way. I do agree that if we > were > > >> to > > >>>>>>>> update > > >>>>>>>>>> the granular metrics and multiple roll-ups, then we should > > >>>>>> consider > > >>>>>>> the > > >>>>>>>>>> efficiency. > > >>>>>>>>>> > > >>>>>>>>>> I agree it's probably not necessary to surface the metrics for > > >> all > > >>>>>>>> nodes > > >>>>>>>>>> regardless of whether they can or do skip records. Perhaps we > > can > > >>>>>>>> lazily > > >>>>>>>>>> register the metrics. > > >>>>>>>>>> > > >>>>>>>>>> In my email in response to Matthias, I gave an example of the > > >>>>>> kind of > > >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG > on > > >>>>>> all > > >>>>>>> the > > >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record > > >> once, > > >>>>>>> that > > >>>>>>>>> it > > >>>>>>>>>> would ever happen again. The solution is to capture all the > > >>>>>> available > > >>>>>>>>>> information about the reason and location of skips all the > time. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> One thing you mention is the notion of setting alerts on > coarser > > >>>>>>>> metrics > > >>>>>>>>>> being easier than finer ones. All the metric alerting systems > I > > >>>>>> have > > >>>>>>>> used > > >>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. > So > > >>>>>> my > > >>>>>>>>>> experience doesn't say that this is a use case. Were you > > thinking > > >>>>>> of > > >>>>>>> an > > >>>>>>>>>> alerting system that makes such a pre-aggregation valuable? > > >>>>>>>>>> > > >>>>>>>>>> Thanks again, > > >>>>>>>>>> -John > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < > > >>>>>> wangg...@gmail.com> > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hello John, > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for the KIP. Some comments: > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Could you list all the possible values of the "reason" > tag? > > >>>>>> In > > >>>>>>> the > > >>>>>>>>>> JIRA > > >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if > > you're > > >>>>>>>> going > > >>>>>>>>> to > > >>>>>>>>>>> categorize each of them as a separate reason, or is there any > > >>>>>>>>> additional > > >>>>>>>>>>> ones you have in mind. > > >>>>>>>>>>> > > >>>>>>>>>>> Also I'm wondering if we should add another metric that do > not > > >>>>>> have > > >>>>>>>> the > > >>>>>>>>>>> reason tag but aggregates among all possible reasons? This is > > >>>>>> for > > >>>>>>>> users > > >>>>>>>>>> to > > >>>>>>>>>>> easily set their alerting notifications (otherwise they have > to > > >>>>>>> write > > >>>>>>>>> on > > >>>>>>>>>>> notification rule per reason) in their monitoring systems. > > >>>>>>>>>>> > > >>>>>>>>>>> 2. Note that the processor-node metrics is actually > > "per-thread, > > >>>>>>>>>> per-task, > > >>>>>>>>>>> per-processor-node", and today we only set the per-thread > > >>>>>> metrics > > >>>>>>> as > > >>>>>>>>> INFO > > >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with > your > > >>>>>>>> argument > > >>>>>>>>>>> that we are missing the per-client roll-up metrics today, but > > >>>>>> I'm > > >>>>>>>>>> convinced > > >>>>>>>>>>> that the right way to approach it would be > > >>>>>>>> "just-providing-the-lowest- > > >>>>>>>>>>> level > > >>>>>>>>>>> metrics only". > > >>>>>>>>>>> > > >>>>>>>>>>> Note the recoding implementation of these three levels are > > >>>>>>> different > > >>>>>>>>>>> internally today: we did not just do the rolling up to > generate > > >>>>>> the > > >>>>>>>>>>> higher-level metrics from the lower level ones, but we just > > >>>>>> record > > >>>>>>>> them > > >>>>>>>>>>> separately, which means that, if we turn on multiple levels > of > > >>>>>>>> metrics, > > >>>>>>>>>> we > > >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that > is > > >>>>>> not > > >>>>>>>> the > > >>>>>>>>>> best > > >>>>>>>>>>> way to represent multi-level metrics collecting and > reporting, > > >>>>>> but > > >>>>>>> by > > >>>>>>>>>> only > > >>>>>>>>>>> enabling thread-level metrics as INFO today, that > > implementation > > >>>>>>>> could > > >>>>>>>>> be > > >>>>>>>>>>> more efficient than only collecting the metrics at the lowest > > >>>>>>> level, > > >>>>>>>>> and > > >>>>>>>>>>> then do the roll-up calculations outside of the metrics > > classes. > > >>>>>>>>>>> > > >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip > records, > > >>>>>>> AFAIK > > >>>>>>>> we > > >>>>>>>>>>> will only skip records at the source, sink, window and > > >>>>>> aggregation > > >>>>>>>>>>> processor nodes, so adding a metric per processor looks like > an > > >>>>>>>>> overkill > > >>>>>>>>>> to > > >>>>>>>>>>> me as well. On the other hand, from user's perspective the > > >>>>>> "reason" > > >>>>>>>> tag > > >>>>>>>>>> may > > >>>>>>>>>>> be sufficient for them to narrow down where inside the > topology > > >>>>>> is > > >>>>>>>>>> causing > > >>>>>>>>>>> records to be dropped on the floor. So I think the > "per-thread, > > >>>>>>>>> per-task" > > >>>>>>>>>>> level metrics should be sufficient for them in trouble shoot > in > > >>>>>>> DEBUG > > >>>>>>>>>> mode, > > >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO > which > > >>>>>> is > > >>>>>>>>> turned > > >>>>>>>>>>> on by default. So under normal execution users still only > need > > >>>>>> INFO > > >>>>>>>>> level > > >>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records > > >>>>>>> metrics > > >>>>>>>> as > > >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on > > DEBUG > > >>>>>>>>> metrics > > >>>>>>>>>> to > > >>>>>>>>>>> look into which task is actually causing the skipped records. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Guozhang > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < > > >>>>>>>>> matth...@confluent.io> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Thanks for the KIP John. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering > what > > >>>>>>>>> `reason` > > >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP > should > > >>>>>>> list > > >>>>>>>>>> those > > >>>>>>>>>>>> IMHO. > > >>>>>>>>>>>> > > >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that > > >>>>>>>>>>>> > > >>>>>>>>>>>>> the coarse metric aggregates across two dimensions > > >>>>>>> simultaneously > > >>>>>>>>>>>> > > >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced > atm > > >>>>>>> that > > >>>>>>>>> we > > >>>>>>>>>>>> should put the fine grained metrics into INFO level and > remove > > >>>>>>> the > > >>>>>>>>>>>> roll-up at thread level. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Given that they have to do this sum to get a usable > > >>>>>> top-level > > >>>>>>>> view > > >>>>>>>>>>>> > > >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion. > > >>>>>>> Offering > > >>>>>>>> a > > >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box > might > > >>>>>> be > > >>>>>>> a > > >>>>>>>>>>>> better solution. In the past we did not offer this due to > > >>>>>>>> performance > > >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you > > >>>>>>>> disagree, > > >>>>>>>>>> can > > >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected > > >>>>>>>>> alternatives" > > >>>>>>>>>>>> section. > > >>>>>>>>>>>> > > >>>>>>>>>>>> To rephrase: I understand the issue about missing top-level > > >>>>>> view, > > >>>>>>>> but > > >>>>>>>>>>>> instead of going more fine grained, we should consider to > add > > >>>>>>> this > > >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at > DEBUG > > >>>>>>> level > > >>>>>>>>>> only > > >>>>>>>>>>>> > > >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove > old > > >>>>>>>> metrics > > >>>>>>>>>>>> directly as you suggested. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> -Matthias > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: > > >>>>>>>>>>>>> Looks good to me. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < > > >>>>>>> j...@confluent.io > > >>>>>>>>> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hello all, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around > > >>>>>> skipped > > >>>>>>>>> records > > >>>>>>>>>>> in > > >>>>>>>>>>>>>> Streams. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Please find the details here: > > >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Please let me know what you think! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> -John > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> -- Guozhang > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> -- Guozhang > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> -- Guozhang > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >> > > > > > > > > > > > > > > -- -- Guozhang