Cool, if you're satisfied with the KIP now, maybe I can lobby for your vote ;)
The vote thread is still at only one binding +1, I think. Thanks, -John On Tue, Apr 3, 2018 at 9:05 PM, Matthias J. Sax <matth...@confluent.io> wrote: > Sounds great! > > The cryptic topic names can be an issue -- however, people can > `describe()` their topology to map the name to the corresponding > sub-topology/tasks to narrow the error down to the corresponding > operators. I think, this should be "sufficient for now" for debugging. > > Renaming those topic seems to be out-of-scope for this KIP. > > > -Matthias > > On 4/3/18 2:45 PM, Guozhang Wang wrote: > > Thanks John, your proposal looks fine to me. > > > > I'll go ahead and look into the PR for more details myself. > > > > > > Guozhang > > > > On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bbej...@gmail.com> wrote: > > > >> Hi John, > >> > >> Thanks for making the updates. > >> > >> I agree with the information you've included in the logs as described > >> above, as log statements without enough context/information can be > >> frustrating. > >> > >> -Bill > >> > >> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <j...@confluent.io> wrote: > >> > >>> Allrighty, how about this, then... > >>> > >>> I'll move the metric back to the StreamThread and maintain the existing > >> tag > >>> (client-id=...(per-thread client-id)). It won't be present in the > >>> TopologyTestDriver's metrics. > >>> > >>> As a side note, I'm not sure that the location of the log messages has > >>> visibility into the name of the thread or the task, or the processor > >> node, > >>> for that matter. But at the end of the day, I don't think it really > >>> matters. > >>> > >>> None of those identifiers are in the public interface or > user-controlled. > >>> For them to be useful for debugging, users would have to gain a very > deep > >>> understanding of how their DSL program gets executed. From my > >> perspective, > >>> they are all included in metric tags only to prevent collisions between > >> the > >>> same metrics in different (e.g.) threads. > >>> > >>> I think what's important is to provide the right information in the > logs > >>> that users will be able to debug their issues. This is why the logs in > my > >>> pr include the topic/partition/offset of the offending data, as well as > >> the > >>> stacktrace of the exception from the deserializer (or for timestamps, > the > >>> extracted timestamp and the class name of their extractor). This > >>> information alone should let them pinpoint the offending data and fix > it. > >>> > >>> (I am aware that that topic name might be a repartition topic, and > >>> therefore also esoteric from the user's perspective, but I think it's > the > >>> best we can do right now. It might be nice to explicitly take on a > >>> debugging ergonomics task in the future and give all processor nodes > >>> human-friendly names. Then, we could surface these names in any logs or > >>> exceptions. But I'm inclined to call this out-of-scope for now.) > >>> > >>> Thanks again, > >>> -John > >>> > >>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wangg...@gmail.com> > >> wrote: > >>> > >>>> 1. If we can indeed gather all the context information from the log4j > >>>> entries I'd suggest we change to thread-level (I'm not sure if that is > >>>> doable, so if John have already some WIP PR that can help us decide). > >>>> > >>>> 2. We can consider adding the API in TopologyTestDriver for general > >>> testing > >>>> purposes; that being said, I think Matthias has a good point that this > >>>> alone should not be a driving motivation for us to keep this metric as > >>>> task-level if 1) is true. > >>>> > >>>> > >>>> > >>>> Guozhang > >>>> > >>>> > >>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax < > >> matth...@confluent.io> > >>>> wrote: > >>>> > >>>>> Thanks Guozhang, that was my intent. > >>>>> > >>>>> @John: yes, we should not nail down the exact log message. It's just > >> to > >>>>> point out the trade-off. If we can get the required information in > >> the > >>>>> logs, we might not need task level metrics. > >>>>> > >>>>> > >>>>> -Matthias > >>>>> > >>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote: > >>>>>> I think Matthias' comment is that, we can still record the metrics > >> on > >>>> the > >>>>>> thread-level, while having the WARN log entry to include sufficient > >>>>> context > >>>>>> information so that users can still easily narrow down the > >>>> investigation > >>>>>> scope. > >>>>>> > >>>>>> > >>>>>> Guozhang > >>>>>> > >>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> > >>>> wrote: > >>>>>> > >>>>>>> I agree we should add as much information as is reasonable to the > >>> log. > >>>>> For > >>>>>>> example, see this WIP PR I started for this KIP: > >>>>>>> > >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff- > >>>>>>> 88d129f048bc842c7db5b2566a45fce8R80 > >>>>>>> > >>>>>>> and > >>>>>>> > >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff- > >>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111 > >>>>>>> > >>>>>>> I'm not sure if we should nail down the log messages in the KIP or > >>> in > >>>>> the > >>>>>>> PR discussion. What say you? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> -John > >>>>>>> > >>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax < > >>>> matth...@confluent.io > >>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am > >>> not > >>>>>>>> sure about the right log level either. Your arguments are > >>> convincing > >>>> -- > >>>>>>>> thus, I am fine with keeping WARN level. > >>>>>>>> > >>>>>>>> The task vs thread level argument is an interesting one. > >> However, I > >>>> am > >>>>>>>> wondering if we should add this information into the > >> corresponding > >>>> WARN > >>>>>>>> logs that we write anyway? For this case, we can also log the > >>>>>>>> corresponding operator (and other information like topic name etc > >>> if > >>>>>>>> needed). WDYT about this? > >>>>>>>> > >>>>>>>> > >>>>>>>> -Matthias > >>>>>>>> > >>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote: > >>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level > >>> since > >>>>>>>> skipped > >>>>>>>>> records are not expected in normal execution (for all reasons > >> that > >>>> we > >>>>>>> are > >>>>>>>>> aware of), and hence when error happens users should be alerted > >>> from > >>>>>>>>> metrics and looked into the log files, so to me if it is really > >>>>>>> spamming > >>>>>>>>> the log files it is also a good alert for users. Besides for > >>>>>>> deserialize > >>>>>>>>> errors we already log at WARN level for this reason. > >>>>>>>>> > >>>>>>>>> Regarding the metrics-levels: I was pondering on that as well. > >>> What > >>>>>>> made > >>>>>>>> me > >>>>>>>>> to think and agree on task-level than thread-level is that for > >>> some > >>>>>>>> reasons > >>>>>>>>> like window retention, they may possibly be happening on a > >> subset > >>> of > >>>>>>>> input > >>>>>>>>> partitions, and tasks are correlated with partitions the > >>> task-level > >>>>>>>> metrics > >>>>>>>>> can help users to narrow down on the specific input data > >>> partitions. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Guozhang > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io > >>> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Matthias, > >>>>>>>>>> > >>>>>>>>>> No worries! Thanks for the reply. > >>>>>>>>>> > >>>>>>>>>> 1) There isn't a connection. I tried using the > >> TopologyTestDriver > >>>> to > >>>>>>>> write > >>>>>>>>>> a quick test exercising the current behavior and discovered > >> that > >>>> the > >>>>>>>>>> metrics weren't available. It seemed like they should be, so I > >>>> tacked > >>>>>>>> it on > >>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back > >>>> out. > >>>>>>>>>> > >>>>>>>>>> 2) I was also concerned about that, but I figured it would come > >>> up > >>>> in > >>>>>>>>>> discussion if I just went ahead and proposed it. And here we > >> are! > >>>>>>>>>> > >>>>>>>>>> Here's my thought: maybe there are two classes of skips: > >>>> "controlled" > >>>>>>>> and > >>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I > >>>>>>>> deliberately > >>>>>>>>>> filter out some events, and "uncontrolled" means that I simply > >>>> don't > >>>>>>>>>> account for some feature of the data, and the framework skips > >>> them > >>>>> (as > >>>>>>>>>> opposed to crashing). > >>>>>>>>>> > >>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all > >>>>>>>> uncontrolled > >>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our > >>>> skips > >>>>>>> are > >>>>>>>>>> well documented, so it wouldn't be terrible to have an > >>> application > >>>> in > >>>>>>>> which > >>>>>>>>>> you know you expect to have tons of uncontrolled skips, but > >> it's > >>>> not > >>>>>>>> great > >>>>>>>>>> either, since you may also have some *unexpected* uncontrolled > >>>> skips. > >>>>>>>> It'll > >>>>>>>>>> be difficult to notice, since you're probably not alerting on > >> the > >>>>>>> metric > >>>>>>>>>> and filtering out the logs (whatever their level). > >>>>>>>>>> > >>>>>>>>>> I'd recommend any app author, as an alternative, to convert all > >>>>>>> expected > >>>>>>>>>> skips to controlled ones, by updating the topology to filter > >>> those > >>>>>>>> records > >>>>>>>>>> out. > >>>>>>>>>> > >>>>>>>>>> Following from my recommendation, as a library author, I'm > >>> inclined > >>>>> to > >>>>>>>> mark > >>>>>>>>>> those logs WARN, since in my opinion, they should be concerning > >>> to > >>>>> the > >>>>>>>> app > >>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by > >>>>>>>> default, so > >>>>>>>>>> I would pick INFO at least. > >>>>>>>>>> > >>>>>>>>>> That said, logging is always a tricky issue for lower-level > >>>> libraries > >>>>>>>> that > >>>>>>>>>> run inside user code, since we don't have all the information > >> we > >>>> need > >>>>>>> to > >>>>>>>>>> make the right call. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as > >>>> well. > >>>>>>>>>> Thanks for the clarification. > >>>>>>>>>> > >>>>>>>>>> -John > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < > >>>>>>> matth...@confluent.io> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> John, > >>>>>>>>>>> > >>>>>>>>>>> sorry for my late reply and thanks for updating the KIP. > >>>>>>>>>>> > >>>>>>>>>>> I like your approach about "metrics are for monitoring, logs > >> are > >>>> for > >>>>>>>>>>> debugging" -- however: > >>>>>>>>>>> > >>>>>>>>>>> 1) I don't see a connection between this and the task-level > >>>> metrics > >>>>>>>> that > >>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I > >> don't > >>>>> think > >>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus > >> wondering > >>>> why > >>>>>>> it > >>>>>>>>>>> is important to include the metrics there? Thread-level metric > >>>> might > >>>>>>> be > >>>>>>>>>>> easier to monitor though (ie, less different metric to > >> monitor). > >>>>>>>>>>> > >>>>>>>>>>> 2) I am a little worried about WARN level logging and that it > >>>> might > >>>>>>> be > >>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus > >>> DEBUG > >>>>>>>> level > >>>>>>>>>>> might be better. Not 100% sure about this to be honest. What > >> is > >>>> the > >>>>>>>>>>> general assumption about the frequency for skipped records? I > >>>> could > >>>>>>>>>>> imagine cases for which skipped records are quite frequent and > >>>> thus, > >>>>>>>>>>> WARN level logs might "flood" the logs > >>>>>>>>>>> > >>>>>>>>>>> One final remark: > >>>>>>>>>>> > >>>>>>>>>>>> More > >>>>>>>>>>>> generally, I would like to establish a pattern in which we > >>> could > >>>>> add > >>>>>>>>>> new > >>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so. > >>>>>>>>>>> > >>>>>>>>>>> From my understanding, this is not feasible. Changing metrics > >> is > >>>>>>> always > >>>>>>>>>>> considered a public API change, and we need a KIP for any > >>> change. > >>>> As > >>>>>>> we > >>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore > >>> -- > >>>>>>> just > >>>>>>>>>>> wanted to point it out. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -Matthias > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote: > >>>>>>>>>>>> Allrighty! The KIP is updated. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks again, all, for the feedback. > >>>>>>>>>>>> -John > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler < > >>> j...@confluent.io > >>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hey Guozhang and Bill, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing > >> consensus, > >>>> I'd > >>>>>>>> like > >>>>>>>>>>> to > >>>>>>>>>>>>> put it in the task instead of the thread so that it'll show > >> up > >>>> in > >>>>>>> the > >>>>>>>>>>>>> TopologyTestDriver metrics as well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now, > >>> but > >>>> if > >>>>>>>>>>> others > >>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go > >>>> that > >>>>>>>>>> route. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks all, > >>>>>>>>>>>>> -John > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck < > >>> bbej...@gmail.com > >>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm on the fence with providing a single level metrics, > >> but I > >>>>>>> think > >>>>>>>>>>> we'll > >>>>>>>>>>>>>> have that discussion outside of this KIP. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, > >>>>>>>> per-task, > >>>>>>>>>>> or > >>>>>>>>>>>>>>> per-processor-node) with no "reason" > >>>>>>>>>>>>>>> * introduce a warn-level log detailing the > >>>>> topic/partition/offset > >>>>>>>>>> and > >>>>>>>>>>>>>>> reason of the skipped record > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm +1 on both of these suggestions. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics > >>>> around > >>>>>>>>>> when > >>>>>>>>>>>>>> persistent store removes an expired window. Would adding > >>> that > >>>> to > >>>>>>>> our > >>>>>>>>>>>>>> metrics stretch the scope of this KIP too much? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Bill > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < > >>>>>>> wangg...@gmail.com> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one > >>> level > >>>>> of > >>>>>>>>>>>>>> metrics" > >>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP > >>> since > >>>>>>> that > >>>>>>>>>>>>>> would > >>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are > >>> going > >>>> to > >>>>>>>>>>>>>> maintain > >>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we > >>> would > >>>>> let > >>>>>>>>>>> users > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully > >>> convinced > >>>>>>> that > >>>>>>>>>> we > >>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I > >> think > >>>> for > >>>>>>>>>>>>>> different > >>>>>>>>>>>>>>> metrics people may be interested to investigate into > >>> different > >>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at > >> the > >>>>>>> lowest > >>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate > >> they > >>>> can > >>>>>>> be > >>>>>>>>>> as > >>>>>>>>>>>>>> low > >>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups > >> may > >>>> not > >>>>>>> be > >>>>>>>>>>> very > >>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an > >>>> efficient > >>>>>>>> and > >>>>>>>>>>>>>> easy > >>>>>>>>>>>>>>> implementation of that I can be persuaded :) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> For now I'm thinking we can add the metric on > >> thread-level, > >>>>>>> either > >>>>>>>>>>> with > >>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated > >> one > >>>>>>> without > >>>>>>>>>>> the > >>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the > >>> tag > >>>>>>>> looks > >>>>>>>>>>>>>> good > >>>>>>>>>>>>>>> to me. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Guozhang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler < > >>>>> j...@confluent.io > >>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hey Guozhang, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll > >>>> provide > >>>>>>> a > >>>>>>>>>>>>>> list in > >>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a > >>>> documentation > >>>>>>>> page > >>>>>>>>>>>>>> with > >>>>>>>>>>>>>>>> the metrics listed. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip > >> records. > >>>> On > >>>>>>> the > >>>>>>>>>>>>>> metric > >>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only > >> one > >>>>> level > >>>>>>>> of > >>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we > >>>> don't > >>>>>>>>>> have a > >>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's > >> fine > >>> by > >>>>>>> me. > >>>>>>>> I > >>>>>>>>>>>>>> just > >>>>>>>>>>>>>>>> don't think it provides a lot of extra value per > >> complexity > >>>>>>>>>>> (interface > >>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level > >> in > >>>>>>>>>> addition > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> lower-level metrics. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> How about this instead: > >>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be > >> per-thread, > >>>>>>>>>> per-task, > >>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>> per-processor-node) with no "reason" > >>>>>>>>>>>>>>>> * introduce a warn-level log detailing the > >>>>>>> topic/partition/offset > >>>>>>>>>> and > >>>>>>>>>>>>>>>> reason of the skipped record > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> If you like that, I can update the KIP. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> -John > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < > >>>>>>>> wangg...@gmail.com > >>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts > >> on > >>>>>>> coarser > >>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting > >>>> systems > >>>>> I > >>>>>>>>>> have > >>>>>>>>>>>>>>> used > >>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over > >>>> tags. > >>>>>>> So > >>>>>>>>>> my > >>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you > >>>>>>> thinking > >>>>>>>>>>>>>> of an > >>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation > >>> valuable? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded > >>>>>>> directly > >>>>>>>>>> as > >>>>>>>>>>>>>>> part > >>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they > >>>> need > >>>>>>> to > >>>>>>>>>>>>>> know > >>>>>>>>>>>>>>>> these > >>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want > >> to > >>>>> list > >>>>>>>>>> all > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example > >> of > >>>> the > >>>>>>>>>>>>>> kind of > >>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with > >>> DEBUG > >>>>> on > >>>>>>>>>> all > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped > >>> record > >>>>>>>> once, > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>> it > >>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all > >>> the > >>>>>>>>>>>>>> available > >>>>>>>>>>>>>>>>> information about the reason and location of skips all > >> the > >>>>>>> time. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all > >>>> levels > >>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>> by > >>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and > >>> get > >>>>>>> rid > >>>>>>>>>> of > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will > >> be > >>> a > >>>>>>> much > >>>>>>>>>>>>>> larger > >>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not > >>>> purely > >>>>>>>>>>>>>> depend > >>>>>>>>>>>>>>> on > >>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users > >>>> only > >>>>>>> be > >>>>>>>>>>>>>>> alerted > >>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j > >> entry > >>>>> each > >>>>>>>>>>>>>> time we > >>>>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon > >>>> being > >>>>>>>>>>>>>> notified > >>>>>>>>>>>>>>>>> users can look into the logs to find the details on > >> where > >>> / > >>>>>>> when > >>>>>>>>>> it > >>>>>>>>>>>>>>>>> happens. WDYT? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Guozhang > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < > >>>>>>> j...@confluent.io > >>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hey Guozhang, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks for the review. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 1. > >>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason" > >> tag > >>>>>>> values. > >>>>>>>>>> I > >>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>> list > >>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm > >> thinking > >>>>> this > >>>>>>>>>>>>>> level > >>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and > >>>>> documentation > >>>>>>>>>>>>>> review > >>>>>>>>>>>>>>>>> would > >>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it > >>>> included > >>>>>>> in > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> KIP, I > >>>>>>>>>>>>>>>>>> can certainly do that. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the > >>> pattern > >>>> of > >>>>>>>>>>>>>> keeping > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions > >>> that > >>>>>>> were > >>>>>>>>>>>>>>>>> rolled-up. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 2. > >>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really > >>>>>>> becomes a > >>>>>>>>>>>>>>> factor > >>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse > >> metric > >>>> at > >>>>>>> the > >>>>>>>>>>>>>>> thread > >>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node > >> level, > >>>>>>> since > >>>>>>>>>>>>>> it's > >>>>>>>>>>>>>>>> just > >>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if > >>> we > >>>>>>> were > >>>>>>>>>> to > >>>>>>>>>>>>>>>> update > >>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we > >>> should > >>>>>>>>>>>>>> consider > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> efficiency. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the > >>> metrics > >>>>> for > >>>>>>>>>> all > >>>>>>>>>>>>>>>> nodes > >>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records. > >>> Perhaps > >>>> we > >>>>>>>> can > >>>>>>>>>>>>>>>> lazily > >>>>>>>>>>>>>>>>>> register the metrics. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example > >> of > >>>> the > >>>>>>>>>>>>>> kind of > >>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with > >>>> DEBUG > >>>>>>> on > >>>>>>>>>>>>>> all > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped > >>>> record > >>>>>>>>>> once, > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> it > >>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all > >>> the > >>>>>>>>>>>>>> available > >>>>>>>>>>>>>>>>>> information about the reason and location of skips all > >>> the > >>>>>>> time. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts > >> on > >>>>>>> coarser > >>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting > >>>> systems > >>>>>>> I > >>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>> used > >>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over > >>>> tags. > >>>>>>> So > >>>>>>>>>>>>>> my > >>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were > >> you > >>>>>>>> thinking > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> an > >>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation > >>> valuable? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks again, > >>>>>>>>>>>>>>>>>> -John > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < > >>>>>>>>>>>>>> wangg...@gmail.com> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hello John, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the > >>> "reason" > >>>>>>> tag? > >>>>>>>>>>>>>> In > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> JIRA > >>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear > >>> if > >>>>>>>> you're > >>>>>>>>>>>>>>>> going > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is > >>> there > >>>>> any > >>>>>>>>>>>>>>>>> additional > >>>>>>>>>>>>>>>>>>> ones you have in mind. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric > >> that > >>> do > >>>>>>> not > >>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? > >>> This > >>>>> is > >>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>> users > >>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise > >> they > >>>> have > >>>>>>> to > >>>>>>>>>>>>>>> write > >>>>>>>>>>>>>>>>> on > >>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring > >>> systems. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually > >>>>>>>> "per-thread, > >>>>>>>>>>>>>>>>>> per-task, > >>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the > >>> per-thread > >>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>> INFO > >>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree > >>> with > >>>>>>> your > >>>>>>>>>>>>>>>> argument > >>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics > >>> today, > >>>>> but > >>>>>>>>>>>>>> I'm > >>>>>>>>>>>>>>>>>> convinced > >>>>>>>>>>>>>>>>>>> that the right way to approach it would be > >>>>>>>>>>>>>>>> "just-providing-the-lowest- > >>>>>>>>>>>>>>>>>>> level > >>>>>>>>>>>>>>>>>>> metrics only". > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels > >>> are > >>>>>>>>>>>>>>> different > >>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to > >>>>>>> generate > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we > >>>> just > >>>>>>>>>>>>>> record > >>>>>>>>>>>>>>>> them > >>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple > >>>> levels > >>>>>>> of > >>>>>>>>>>>>>>>> metrics, > >>>>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue > >>>> that > >>>>>>> is > >>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> best > >>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and > >>>>>>> reporting, > >>>>>>>>>>>>>> but > >>>>>>>>>>>>>>> by > >>>>>>>>>>>>>>>>>> only > >>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that > >>>>>>>> implementation > >>>>>>>>>>>>>>>> could > >>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the > >>>>> lowest > >>>>>>>>>>>>>>> level, > >>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the > >> metrics > >>>>>>>> classes. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip > >>>>>>> records, > >>>>>>>>>>>>>>> AFAIK > >>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and > >>>>>>>>>>>>>> aggregation > >>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor > >> looks > >>>> like > >>>>>>> an > >>>>>>>>>>>>>>>>> overkill > >>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective > >>> the > >>>>>>>>>>>>>> "reason" > >>>>>>>>>>>>>>>> tag > >>>>>>>>>>>>>>>>>> may > >>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the > >>>>>>> topology > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>> causing > >>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the > >>>>>>> "per-thread, > >>>>>>>>>>>>>>>>> per-task" > >>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble > >>>> shoot > >>>>>>> in > >>>>>>>>>>>>>>> DEBUG > >>>>>>>>>>>>>>>>>> mode, > >>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as > >>> INFO > >>>>>>> which > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>> turned > >>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still > >>> only > >>>>>>> need > >>>>>>>>>>>>>> INFO > >>>>>>>>>>>>>>>>> level > >>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all > >>>> skipped-records > >>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can > >> turn > >>> on > >>>>>>>> DEBUG > >>>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped > >>>>> records. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Guozhang > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < > >>>>>>>>>>>>>>>>> matth...@confluent.io> > >>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks for the KIP John. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am > >>> wondering > >>>>>>> what > >>>>>>>>>>>>>>>>> `reason` > >>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The > >> KIP > >>>>>>> should > >>>>>>>>>>>>>>> list > >>>>>>>>>>>>>>>>>> those > >>>>>>>>>>>>>>>>>>>> IMHO. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you > >> say > >>>> that > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions > >>>>>>>>>>>>>>> simultaneously > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not > >>>> convinced > >>>>>>> atm > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level > >> and > >>>>>>> remove > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>> roll-up at thread level. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable > >>>>>>>>>>>>>> top-level > >>>>>>>>>>>>>>>> view > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the > >>> conclusion. > >>>>>>>>>>>>>>> Offering > >>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the > >> box > >>>>>>> might > >>>>>>>>>>>>>> be > >>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this > >> due > >>> to > >>>>>>>>>>>>>>>> performance > >>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. > >> If > >>>> you > >>>>>>>>>>>>>>>> disagree, > >>>>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the > >>> "Rejected > >>>>>>>>>>>>>>>>> alternatives" > >>>>>>>>>>>>>>>>>>>> section. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing > >>>> top-level > >>>>>>>>>>>>>> view, > >>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should > >> consider > >>> to > >>>>>>> add > >>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics > >> at > >>>>>>> DEBUG > >>>>>>>>>>>>>>> level > >>>>>>>>>>>>>>>>>> only > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to > >>> remove > >>>>>>> old > >>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>>>> directly as you suggested. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> -Matthias > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: > >>>>>>>>>>>>>>>>>>>>> Looks good to me. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < > >>>>>>>>>>>>>>> j...@confluent.io > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Hello all, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics > >> around > >>>>>>>>>>>>>> skipped > >>>>>>>>>>>>>>>>> records > >>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>>>> Streams. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Please find the details here: > >>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ > >>> confluence/display/KAFKA/KIP- > >>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Please let me know what you think! > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>> -John > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>>>>> -- Guozhang > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>>> -- Guozhang > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>> -- Guozhang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> -- Guozhang > >>>> > >>> > >> > > > > > > > >