Hi John, Thanks for making the updates.
I agree with the information you've included in the logs as described above, as log statements without enough context/information can be frustrating. -Bill On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <j...@confluent.io> wrote: > Allrighty, how about this, then... > > I'll move the metric back to the StreamThread and maintain the existing tag > (client-id=...(per-thread client-id)). It won't be present in the > TopologyTestDriver's metrics. > > As a side note, I'm not sure that the location of the log messages has > visibility into the name of the thread or the task, or the processor node, > for that matter. But at the end of the day, I don't think it really > matters. > > None of those identifiers are in the public interface or user-controlled. > For them to be useful for debugging, users would have to gain a very deep > understanding of how their DSL program gets executed. From my perspective, > they are all included in metric tags only to prevent collisions between the > same metrics in different (e.g.) threads. > > I think what's important is to provide the right information in the logs > that users will be able to debug their issues. This is why the logs in my > pr include the topic/partition/offset of the offending data, as well as the > stacktrace of the exception from the deserializer (or for timestamps, the > extracted timestamp and the class name of their extractor). This > information alone should let them pinpoint the offending data and fix it. > > (I am aware that that topic name might be a repartition topic, and > therefore also esoteric from the user's perspective, but I think it's the > best we can do right now. It might be nice to explicitly take on a > debugging ergonomics task in the future and give all processor nodes > human-friendly names. Then, we could surface these names in any logs or > exceptions. But I'm inclined to call this out-of-scope for now.) > > Thanks again, > -John > > On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wangg...@gmail.com> wrote: > > > 1. If we can indeed gather all the context information from the log4j > > entries I'd suggest we change to thread-level (I'm not sure if that is > > doable, so if John have already some WIP PR that can help us decide). > > > > 2. We can consider adding the API in TopologyTestDriver for general > testing > > purposes; that being said, I think Matthias has a good point that this > > alone should not be a driving motivation for us to keep this metric as > > task-level if 1) is true. > > > > > > > > Guozhang > > > > > > On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > Thanks Guozhang, that was my intent. > > > > > > @John: yes, we should not nail down the exact log message. It's just to > > > point out the trade-off. If we can get the required information in the > > > logs, we might not need task level metrics. > > > > > > > > > -Matthias > > > > > > On 4/3/18 11:26 AM, Guozhang Wang wrote: > > > > I think Matthias' comment is that, we can still record the metrics on > > the > > > > thread-level, while having the WARN log entry to include sufficient > > > context > > > > information so that users can still easily narrow down the > > investigation > > > > scope. > > > > > > > > > > > > Guozhang > > > > > > > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> > > wrote: > > > > > > > >> I agree we should add as much information as is reasonable to the > log. > > > For > > > >> example, see this WIP PR I started for this KIP: > > > >> > > > >> https://github.com/apache/kafka/pull/4812/files#diff- > > > >> 88d129f048bc842c7db5b2566a45fce8R80 > > > >> > > > >> and > > > >> > > > >> https://github.com/apache/kafka/pull/4812/files#diff- > > > >> 69e6789eb675ec978a1abd24fed96eb1R111 > > > >> > > > >> I'm not sure if we should nail down the log messages in the KIP or > in > > > the > > > >> PR discussion. What say you? > > > >> > > > >> Thanks, > > > >> -John > > > >> > > > >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax < > > matth...@confluent.io > > > > > > > >> wrote: > > > >> > > > >>> Thanks for sharing your thoughts. As I mentioned originally, I am > not > > > >>> sure about the right log level either. Your arguments are > convincing > > -- > > > >>> thus, I am fine with keeping WARN level. > > > >>> > > > >>> The task vs thread level argument is an interesting one. However, I > > am > > > >>> wondering if we should add this information into the corresponding > > WARN > > > >>> logs that we write anyway? For this case, we can also log the > > > >>> corresponding operator (and other information like topic name etc > if > > > >>> needed). WDYT about this? > > > >>> > > > >>> > > > >>> -Matthias > > > >>> > > > >>> On 4/2/18 8:31 PM, Guozhang Wang wrote: > > > >>>> Regarding logging: I'm inclined to keep logging at WARN level > since > > > >>> skipped > > > >>>> records are not expected in normal execution (for all reasons that > > we > > > >> are > > > >>>> aware of), and hence when error happens users should be alerted > from > > > >>>> metrics and looked into the log files, so to me if it is really > > > >> spamming > > > >>>> the log files it is also a good alert for users. Besides for > > > >> deserialize > > > >>>> errors we already log at WARN level for this reason. > > > >>>> > > > >>>> Regarding the metrics-levels: I was pondering on that as well. > What > > > >> made > > > >>> me > > > >>>> to think and agree on task-level than thread-level is that for > some > > > >>> reasons > > > >>>> like window retention, they may possibly be happening on a subset > of > > > >>> input > > > >>>> partitions, and tasks are correlated with partitions the > task-level > > > >>> metrics > > > >>>> can help users to narrow down on the specific input data > partitions. > > > >>>> > > > >>>> > > > >>>> Guozhang > > > >>>> > > > >>>> > > > >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io> > > > >> wrote: > > > >>>> > > > >>>>> Hi Matthias, > > > >>>>> > > > >>>>> No worries! Thanks for the reply. > > > >>>>> > > > >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver > > to > > > >>> write > > > >>>>> a quick test exercising the current behavior and discovered that > > the > > > >>>>> metrics weren't available. It seemed like they should be, so I > > tacked > > > >>> it on > > > >>>>> to this KIP. If you feel it's inappropriate, I can pull it back > > out. > > > >>>>> > > > >>>>> 2) I was also concerned about that, but I figured it would come > up > > in > > > >>>>> discussion if I just went ahead and proposed it. And here we are! > > > >>>>> > > > >>>>> Here's my thought: maybe there are two classes of skips: > > "controlled" > > > >>> and > > > >>>>> "uncontrolled", where "controlled" means, as an app author, I > > > >>> deliberately > > > >>>>> filter out some events, and "uncontrolled" means that I simply > > don't > > > >>>>> account for some feature of the data, and the framework skips > them > > > (as > > > >>>>> opposed to crashing). > > > >>>>> > > > >>>>> In this breakdowns, the skips I'm adding metrics for are all > > > >>> uncontrolled > > > >>>>> skips (and we hope to measure all the uncontrolled skips). Our > > skips > > > >> are > > > >>>>> well documented, so it wouldn't be terrible to have an > application > > in > > > >>> which > > > >>>>> you know you expect to have tons of uncontrolled skips, but it's > > not > > > >>> great > > > >>>>> either, since you may also have some *unexpected* uncontrolled > > skips. > > > >>> It'll > > > >>>>> be difficult to notice, since you're probably not alerting on the > > > >> metric > > > >>>>> and filtering out the logs (whatever their level). > > > >>>>> > > > >>>>> I'd recommend any app author, as an alternative, to convert all > > > >> expected > > > >>>>> skips to controlled ones, by updating the topology to filter > those > > > >>> records > > > >>>>> out. > > > >>>>> > > > >>>>> Following from my recommendation, as a library author, I'm > inclined > > > to > > > >>> mark > > > >>>>> those logs WARN, since in my opinion, they should be concerning > to > > > the > > > >>> app > > > >>>>> authors. I'd definitely want to show, rather than hide, them by > > > >>> default, so > > > >>>>> I would pick INFO at least. > > > >>>>> > > > >>>>> That said, logging is always a tricky issue for lower-level > > libraries > > > >>> that > > > >>>>> run inside user code, since we don't have all the information we > > need > > > >> to > > > >>>>> make the right call. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On your last note, yeah, I got that impression from Guozhang as > > well. > > > >>>>> Thanks for the clarification. > > > >>>>> > > > >>>>> -John > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < > > > >> matth...@confluent.io> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> John, > > > >>>>>> > > > >>>>>> sorry for my late reply and thanks for updating the KIP. > > > >>>>>> > > > >>>>>> I like your approach about "metrics are for monitoring, logs are > > for > > > >>>>>> debugging" -- however: > > > >>>>>> > > > >>>>>> 1) I don't see a connection between this and the task-level > > metrics > > > >>> that > > > >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't > > > think > > > >>>>>> people would monitor the `TopologyTestDriver` an thus wondering > > why > > > >> it > > > >>>>>> is important to include the metrics there? Thread-level metric > > might > > > >> be > > > >>>>>> easier to monitor though (ie, less different metric to monitor). > > > >>>>>> > > > >>>>>> 2) I am a little worried about WARN level logging and that it > > might > > > >> be > > > >>>>>> too chatty -- as you pointed out, it's about debugging, thus > DEBUG > > > >>> level > > > >>>>>> might be better. Not 100% sure about this to be honest. What is > > the > > > >>>>>> general assumption about the frequency for skipped records? I > > could > > > >>>>>> imagine cases for which skipped records are quite frequent and > > thus, > > > >>>>>> WARN level logs might "flood" the logs > > > >>>>>> > > > >>>>>> One final remark: > > > >>>>>> > > > >>>>>>> More > > > >>>>>>> generally, I would like to establish a pattern in which we > could > > > add > > > >>>>> new > > > >>>>>>> values for the "reason" tags without needing a KIP to do so. > > > >>>>>> > > > >>>>>> From my understanding, this is not feasible. Changing metrics is > > > >> always > > > >>>>>> considered a public API change, and we need a KIP for any > change. > > As > > > >> we > > > >>>>>> moved away from tagging, it doesn't matter for the KIP anymore > -- > > > >> just > > > >>>>>> wanted to point it out. > > > >>>>>> > > > >>>>>> > > > >>>>>> -Matthias > > > >>>>>> > > > >>>>>> > > > >>>>>> On 3/30/18 2:47 PM, John Roesler wrote: > > > >>>>>>> Allrighty! The KIP is updated. > > > >>>>>>> > > > >>>>>>> Thanks again, all, for the feedback. > > > >>>>>>> -John > > > >>>>>>> > > > >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler < > j...@confluent.io > > > > > > >>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Hey Guozhang and Bill, > > > >>>>>>>> > > > >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, > > I'd > > > >>> like > > > >>>>>> to > > > >>>>>>>> put it in the task instead of the thread so that it'll show up > > in > > > >> the > > > >>>>>>>> TopologyTestDriver metrics as well. > > > >>>>>>>> > > > >>>>>>>> I'm leaning toward keeping the scope where it is right now, > but > > if > > > >>>>>> others > > > >>>>>>>> want to advocate for tossing in some more metrics, we can go > > that > > > >>>>> route. > > > >>>>>>>> > > > >>>>>>>> Thanks all, > > > >>>>>>>> -John > > > >>>>>>>> > > > >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck < > bbej...@gmail.com > > > > > > >>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Thanks for the KIP John, and sorry for the late comments. > > > >>>>>>>>> > > > >>>>>>>>> I'm on the fence with providing a single level metrics, but I > > > >> think > > > >>>>>> we'll > > > >>>>>>>>> have that discussion outside of this KIP. > > > >>>>>>>>> > > > >>>>>>>>>> * maintain one skipped-record metric (could be per-thread, > > > >>> per-task, > > > >>>>>> or > > > >>>>>>>>>> per-processor-node) with no "reason" > > > >>>>>>>>>> * introduce a warn-level log detailing the > > > topic/partition/offset > > > >>>>> and > > > >>>>>>>>>> reason of the skipped record > > > >>>>>>>>> > > > >>>>>>>>> I'm +1 on both of these suggestions. > > > >>>>>>>>> > > > >>>>>>>>> Finally, we have had requests in the past for some metrics > > around > > > >>>>> when > > > >>>>>>>>> persistent store removes an expired window. Would adding > that > > to > > > >>> our > > > >>>>>>>>> metrics stretch the scope of this KIP too much? > > > >>>>>>>>> > > > >>>>>>>>> Thanks again and overall I'm +1 on this KIP > > > >>>>>>>>> > > > >>>>>>>>> Bill > > > >>>>>>>>> > > > >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < > > > >> wangg...@gmail.com> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> The proposal sounds good to me. About "maintain only one > level > > > of > > > >>>>>>>>> metrics" > > > >>>>>>>>>> maybe we can discuss about that separately from this KIP > since > > > >> that > > > >>>>>>>>> would > > > >>>>>>>>>> be a larger scope of discussion. I agree that if we are > going > > to > > > >>>>>>>>> maintain > > > >>>>>>>>>> only one-level metrics it should be lowest level and we > would > > > let > > > >>>>>> users > > > >>>>>>>>> to > > > >>>>>>>>>> do the roll-ups themselves, but I'm still not fully > convinced > > > >> that > > > >>>>> we > > > >>>>>>>>>> should just provide single-level metrics, because 1) I think > > for > > > >>>>>>>>> different > > > >>>>>>>>>> metrics people may be interested to investigate into > different > > > >>>>>>>>>> granularities, e.g. for poll / commit rate these are at the > > > >> lowest > > > >>>>>>>>>> task-level metrics, while for process-rate / skip-rate they > > can > > > >> be > > > >>>>> as > > > >>>>>>>>> low > > > >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may > > not > > > >> be > > > >>>>>> very > > > >>>>>>>>>> straight-forward. But for 2) if someone can provide an > > efficient > > > >>> and > > > >>>>>>>>> easy > > > >>>>>>>>>> implementation of that I can be persuaded :) > > > >>>>>>>>>> > > > >>>>>>>>>> For now I'm thinking we can add the metric on thread-level, > > > >> either > > > >>>>>> with > > > >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one > > > >> without > > > >>>>>> the > > > >>>>>>>>>> tag, or just having a single aggregated metric without the > tag > > > >>> looks > > > >>>>>>>>> good > > > >>>>>>>>>> to me. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Guozhang > > > >>>>>>>>>> > > > >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler < > > > j...@confluent.io > > > >>> > > > >>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hey Guozhang, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll > > provide > > > >> a > > > >>>>>>>>> list in > > > >>>>>>>>>>> the KIP. I was also thinking we'd better start a > > documentation > > > >>> page > > > >>>>>>>>> with > > > >>>>>>>>>>> the metrics listed. > > > >>>>>>>>>>> > > > >>>>>>>>>>> I'd have no problem logging a warning when we skip records. > > On > > > >> the > > > >>>>>>>>> metric > > > >>>>>>>>>>> front, really I'm just pushing for us to maintain only one > > > level > > > >>> of > > > >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we > > don't > > > >>>>> have a > > > >>>>>>>>>>> metric per reason and log the reason instead), that's fine > by > > > >> me. > > > >>> I > > > >>>>>>>>> just > > > >>>>>>>>>>> don't think it provides a lot of extra value per complexity > > > >>>>>> (interface > > > >>>>>>>>>> and > > > >>>>>>>>>>> implementation) to maintain roll-ups at the thread level in > > > >>>>> addition > > > >>>>>>>>> to > > > >>>>>>>>>>> lower-level metrics. > > > >>>>>>>>>>> > > > >>>>>>>>>>> How about this instead: > > > >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, > > > >>>>> per-task, > > > >>>>>>>>> or > > > >>>>>>>>>>> per-processor-node) with no "reason" > > > >>>>>>>>>>> * introduce a warn-level log detailing the > > > >> topic/partition/offset > > > >>>>> and > > > >>>>>>>>>>> reason of the skipped record > > > >>>>>>>>>>> > > > >>>>>>>>>>> If you like that, I can update the KIP. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks, > > > >>>>>>>>>>> -John > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < > > > >>> wangg...@gmail.com > > > >>>>>> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on > > > >> coarser > > > >>>>>>>>>>> metrics > > > >>>>>>>>>>>> being easier than finer ones. All the metric alerting > > systems > > > I > > > >>>>> have > > > >>>>>>>>>> used > > > >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over > > tags. > > > >> So > > > >>>>> my > > > >>>>>>>>>>>> experience doesn't say that this is a use case. Were you > > > >> thinking > > > >>>>>>>>> of an > > > >>>>>>>>>>>> alerting system that makes such a pre-aggregation > valuable? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded > > > >> directly > > > >>>>> as > > > >>>>>>>>>> part > > > >>>>>>>>>>> of > > > >>>>>>>>>>>> the object name, and if users wants to monitor them they > > need > > > >> to > > > >>>>>>>>> know > > > >>>>>>>>>>> these > > > >>>>>>>>>>>> values before hand. That is also why I think we do want to > > > list > > > >>>>> all > > > >>>>>>>>> the > > > >>>>>>>>>>>> possible values of the reason tags in the KIP, since > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of > > the > > > >>>>>>>>> kind of > > > >>>>>>>>>>>> scenario that would lead me as an operator to run with > DEBUG > > > on > > > >>>>> all > > > >>>>>>>>> the > > > >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped > record > > > >>> once, > > > >>>>>>>>> that > > > >>>>>>>>>>> it > > > >>>>>>>>>>>> would ever happen again. The solution is to capture all > the > > > >>>>>>>>> available > > > >>>>>>>>>>>> information about the reason and location of skips all the > > > >> time. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> That is a good point. I think we can either expose all > > levels > > > >>>>>>>>> metrics > > > >>>>>>>>>> as > > > >>>>>>>>>>> by > > > >>>>>>>>>>>> default, or only expose the most lower-level metrics and > get > > > >> rid > > > >>>>> of > > > >>>>>>>>>> other > > > >>>>>>>>>>>> levels to let users do roll-ups themselves (which will be > a > > > >> much > > > >>>>>>>>> larger > > > >>>>>>>>>>>> scope for discussion), or we can encourage users to not > > purely > > > >>>>>>>>> depend > > > >>>>>>>>>> on > > > >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users > > only > > > >> be > > > >>>>>>>>>> alerted > > > >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry > > > each > > > >>>>>>>>> time we > > > >>>>>>>>>>> are > > > >>>>>>>>>>>> about to skip a record all over the places, so that upon > > being > > > >>>>>>>>> notified > > > >>>>>>>>>>>> users can look into the logs to find the details on where > / > > > >> when > > > >>>>> it > > > >>>>>>>>>>>> happens. WDYT? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Guozhang > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < > > > >> j...@confluent.io > > > >>>> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> Hey Guozhang, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks for the review. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 1. > > > >>>>>>>>>>>>> Matthias raised the same question about the "reason" tag > > > >> values. > > > >>>>> I > > > >>>>>>>>>> can > > > >>>>>>>>>>>> list > > > >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking > > > this > > > >>>>>>>>> level > > > >>>>>>>>>> of > > > >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and > > > documentation > > > >>>>>>>>> review > > > >>>>>>>>>>>> would > > > >>>>>>>>>>>>> be sufficient. If you all disagree and would like it > > included > > > >> in > > > >>>>>>>>> the > > > >>>>>>>>>>>> KIP, I > > > >>>>>>>>>>>>> can certainly do that. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the > pattern > > of > > > >>>>>>>>> keeping > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>> same name but eliminating the tags for the dimensions > that > > > >> were > > > >>>>>>>>>>>> rolled-up. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 2. > > > >>>>>>>>>>>>> I'm not too sure that implementation efficiency really > > > >> becomes a > > > >>>>>>>>>> factor > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>> choosing whether to (by default) update one coarse metric > > at > > > >> the > > > >>>>>>>>>> thread > > > >>>>>>>>>>>>> level or one granular metric at the processor-node level, > > > >> since > > > >>>>>>>>> it's > > > >>>>>>>>>>> just > > > >>>>>>>>>>>>> one metric being updated either way. I do agree that if > we > > > >> were > > > >>>>> to > > > >>>>>>>>>>> update > > > >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we > should > > > >>>>>>>>> consider > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> efficiency. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I agree it's probably not necessary to surface the > metrics > > > for > > > >>>>> all > > > >>>>>>>>>>> nodes > > > >>>>>>>>>>>>> regardless of whether they can or do skip records. > Perhaps > > we > > > >>> can > > > >>>>>>>>>>> lazily > > > >>>>>>>>>>>>> register the metrics. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of > > the > > > >>>>>>>>> kind of > > > >>>>>>>>>>>>> scenario that would lead me as an operator to run with > > DEBUG > > > >> on > > > >>>>>>>>> all > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped > > record > > > >>>>> once, > > > >>>>>>>>>> that > > > >>>>>>>>>>>> it > > > >>>>>>>>>>>>> would ever happen again. The solution is to capture all > the > > > >>>>>>>>> available > > > >>>>>>>>>>>>> information about the reason and location of skips all > the > > > >> time. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on > > > >> coarser > > > >>>>>>>>>>> metrics > > > >>>>>>>>>>>>> being easier than finer ones. All the metric alerting > > systems > > > >> I > > > >>>>>>>>> have > > > >>>>>>>>>>> used > > > >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over > > tags. > > > >> So > > > >>>>>>>>> my > > > >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you > > > >>> thinking > > > >>>>>>>>> of > > > >>>>>>>>>> an > > > >>>>>>>>>>>>> alerting system that makes such a pre-aggregation > valuable? > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks again, > > > >>>>>>>>>>>>> -John > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < > > > >>>>>>>>> wangg...@gmail.com> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Hello John, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks for the KIP. Some comments: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> 1. Could you list all the possible values of the > "reason" > > > >> tag? > > > >>>>>>>>> In > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> JIRA > > > >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear > if > > > >>> you're > > > >>>>>>>>>>> going > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>> categorize each of them as a separate reason, or is > there > > > any > > > >>>>>>>>>>>> additional > > > >>>>>>>>>>>>>> ones you have in mind. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Also I'm wondering if we should add another metric that > do > > > >> not > > > >>>>>>>>> have > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? > This > > > is > > > >>>>>>>>> for > > > >>>>>>>>>>> users > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>> easily set their alerting notifications (otherwise they > > have > > > >> to > > > >>>>>>>>>> write > > > >>>>>>>>>>>> on > > > >>>>>>>>>>>>>> notification rule per reason) in their monitoring > systems. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually > > > >>> "per-thread, > > > >>>>>>>>>>>>> per-task, > > > >>>>>>>>>>>>>> per-processor-node", and today we only set the > per-thread > > > >>>>>>>>> metrics > > > >>>>>>>>>> as > > > >>>>>>>>>>>> INFO > > > >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree > with > > > >> your > > > >>>>>>>>>>> argument > > > >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics > today, > > > but > > > >>>>>>>>> I'm > > > >>>>>>>>>>>>> convinced > > > >>>>>>>>>>>>>> that the right way to approach it would be > > > >>>>>>>>>>> "just-providing-the-lowest- > > > >>>>>>>>>>>>>> level > > > >>>>>>>>>>>>>> metrics only". > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Note the recoding implementation of these three levels > are > > > >>>>>>>>>> different > > > >>>>>>>>>>>>>> internally today: we did not just do the rolling up to > > > >> generate > > > >>>>>>>>> the > > > >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we > > just > > > >>>>>>>>> record > > > >>>>>>>>>>> them > > > >>>>>>>>>>>>>> separately, which means that, if we turn on multiple > > levels > > > >> of > > > >>>>>>>>>>> metrics, > > > >>>>>>>>>>>>> we > > > >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue > > that > > > >> is > > > >>>>>>>>> not > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>> best > > > >>>>>>>>>>>>>> way to represent multi-level metrics collecting and > > > >> reporting, > > > >>>>>>>>> but > > > >>>>>>>>>> by > > > >>>>>>>>>>>>> only > > > >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that > > > >>> implementation > > > >>>>>>>>>>> could > > > >>>>>>>>>>>> be > > > >>>>>>>>>>>>>> more efficient than only collecting the metrics at the > > > lowest > > > >>>>>>>>>> level, > > > >>>>>>>>>>>> and > > > >>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics > > > >>> classes. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip > > > >> records, > > > >>>>>>>>>> AFAIK > > > >>>>>>>>>>> we > > > >>>>>>>>>>>>>> will only skip records at the source, sink, window and > > > >>>>>>>>> aggregation > > > >>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks > > like > > > >> an > > > >>>>>>>>>>>> overkill > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective > the > > > >>>>>>>>> "reason" > > > >>>>>>>>>>> tag > > > >>>>>>>>>>>>> may > > > >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the > > > >> topology > > > >>>>>>>>> is > > > >>>>>>>>>>>>> causing > > > >>>>>>>>>>>>>> records to be dropped on the floor. So I think the > > > >> "per-thread, > > > >>>>>>>>>>>> per-task" > > > >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble > > shoot > > > >> in > > > >>>>>>>>>> DEBUG > > > >>>>>>>>>>>>> mode, > > > >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as > INFO > > > >> which > > > >>>>>>>>> is > > > >>>>>>>>>>>> turned > > > >>>>>>>>>>>>>> on by default. So under normal execution users still > only > > > >> need > > > >>>>>>>>> INFO > > > >>>>>>>>>>>> level > > > >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all > > skipped-records > > > >>>>>>>>>> metrics > > > >>>>>>>>>>> as > > > >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn > on > > > >>> DEBUG > > > >>>>>>>>>>>> metrics > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>> look into which task is actually causing the skipped > > > records. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Guozhang > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < > > > >>>>>>>>>>>> matth...@confluent.io> > > > >>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Thanks for the KIP John. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am > wondering > > > >> what > > > >>>>>>>>>>>> `reason` > > > >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP > > > >> should > > > >>>>>>>>>> list > > > >>>>>>>>>>>>> those > > > >>>>>>>>>>>>>>> IMHO. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say > > that > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions > > > >>>>>>>>>> simultaneously > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not > > convinced > > > >> atm > > > >>>>>>>>>> that > > > >>>>>>>>>>>> we > > > >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and > > > >> remove > > > >>>>>>>>>> the > > > >>>>>>>>>>>>>>> roll-up at thread level. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable > > > >>>>>>>>> top-level > > > >>>>>>>>>>> view > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> This is a fair concern, but I don't share the > conclusion. > > > >>>>>>>>>> Offering > > > >>>>>>>>>>> a > > > >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box > > > >> might > > > >>>>>>>>> be > > > >>>>>>>>>> a > > > >>>>>>>>>>>>>>> better solution. In the past we did not offer this due > to > > > >>>>>>>>>>> performance > > > >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If > > you > > > >>>>>>>>>>> disagree, > > > >>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>> you provide some reasoning and add them to the > "Rejected > > > >>>>>>>>>>>> alternatives" > > > >>>>>>>>>>>>>>> section. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing > > top-level > > > >>>>>>>>> view, > > > >>>>>>>>>>> but > > > >>>>>>>>>>>>>>> instead of going more fine grained, we should consider > to > > > >> add > > > >>>>>>>>>> this > > > >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at > > > >> DEBUG > > > >>>>>>>>>> level > > > >>>>>>>>>>>>> only > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to > remove > > > >> old > > > >>>>>>>>>>> metrics > > > >>>>>>>>>>>>>>> directly as you suggested. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> -Matthias > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: > > > >>>>>>>>>>>>>>>> Looks good to me. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < > > > >>>>>>>>>> j...@confluent.io > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Hello all, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around > > > >>>>>>>>> skipped > > > >>>>>>>>>>>> records > > > >>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>> Streams. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Please find the details here: > > > >>>>>>>>>>>>>>>>> https://cwiki.apache.org/ > confluence/display/KAFKA/KIP- > > > >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Please let me know what you think! > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>> -John > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>> -- Guozhang > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> -- > > > >>>>>>>>>>>> -- Guozhang > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> -- Guozhang > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > >>> > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > > >