Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

John Roesler Tue, 03 Apr 2018 11:23:21 -0700

I agree we should add as much information as is reasonable to the log. For
example, see this WIP PR I started for this KIP:


https://github.com/apache/kafka/pull/4812/files#diff-88d129f048bc842c7db5b2566a45fce8R80

and

https://github.com/apache/kafka/pull/4812/files#diff-69e6789eb675ec978a1abd24fed96eb1R111

I'm not sure if we should nail down the log messages in the KIP or in the
PR discussion. What say you?

Thanks,
-John

On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io>
wrote:

> Thanks for sharing your thoughts. As I mentioned originally, I am not
> sure about the right log level either. Your arguments are convincing --
> thus, I am fine with keeping WARN level.
>
> The task vs thread level argument is an interesting one. However, I am
> wondering if we should add this information into the corresponding WARN
> logs that we write anyway? For this case, we can also log the
> corresponding operator (and other information like topic name etc if
> needed). WDYT about this?
>
>
> -Matthias
>
> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > Regarding logging: I'm inclined to keep logging at WARN level since
> skipped
> > records are not expected in normal execution (for all reasons that we are
> > aware of), and hence when error happens users should be alerted from
> > metrics and looked into the log files, so to me if it is really spamming
> > the log files it is also a good alert for users. Besides for deserialize
> > errors we already log at WARN level for this reason.
> >
> > Regarding the metrics-levels: I was pondering on that as well. What made
> me
> > to think and agree on task-level than thread-level is that for some
> reasons
> > like window retention, they may possibly be happening on a subset of
> input
> > partitions, and tasks are correlated with partitions the task-level
> metrics
> > can help users to narrow down on the specific input data partitions.
> >
> >
> > Guozhang
> >
> >
> > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io> wrote:
> >
> >> Hi Matthias,
> >>
> >> No worries! Thanks for the reply.
> >>
> >> 1) There isn't a connection. I tried using the TopologyTestDriver to
> write
> >> a quick test exercising the current behavior and discovered that the
> >> metrics weren't available. It seemed like they should be, so I tacked
> it on
> >> to this KIP. If you feel it's inappropriate, I can pull it back out.
> >>
> >> 2) I was also concerned about that, but I figured it would come up in
> >> discussion if I just went ahead and proposed it. And here we are!
> >>
> >> Here's my thought: maybe there are two classes of skips: "controlled"
> and
> >> "uncontrolled", where "controlled" means, as an app author, I
> deliberately
> >> filter out some events, and "uncontrolled" means that I simply don't
> >> account for some feature of the data, and the framework skips them (as
> >> opposed to crashing).
> >>
> >> In this breakdowns, the skips I'm adding metrics for are all
> uncontrolled
> >> skips (and we hope to measure all the uncontrolled skips). Our skips are
> >> well documented, so it wouldn't be terrible to have an application in
> which
> >> you know you expect to have tons of uncontrolled skips, but it's not
> great
> >> either, since you may also have some *unexpected* uncontrolled skips.
> It'll
> >> be difficult to notice, since you're probably not alerting on the metric
> >> and filtering out the logs (whatever their level).
> >>
> >> I'd recommend any app author, as an alternative, to convert all expected
> >> skips to controlled ones, by updating the topology to filter those
> records
> >> out.
> >>
> >> Following from my recommendation, as a library author, I'm inclined to
> mark
> >> those logs WARN, since in my opinion, they should be concerning to the
> app
> >> authors. I'd definitely want to show, rather than hide, them by
> default, so
> >> I would pick INFO at least.
> >>
> >> That said, logging is always a tricky issue for lower-level libraries
> that
> >> run inside user code, since we don't have all the information we need to
> >> make the right call.
> >>
> >>
> >>
> >> On your last note, yeah, I got that impression from Guozhang as well.
> >> Thanks for the clarification.
> >>
> >> -John
> >>
> >>
> >>
> >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <matth...@confluent.io>
> >> wrote:
> >>
> >>> John,
> >>>
> >>> sorry for my late reply and thanks for updating the KIP.
> >>>
> >>> I like your approach about "metrics are for monitoring, logs are for
> >>> debugging" -- however:
> >>>
> >>> 1) I don't see a connection between this and the task-level metrics
> that
> >>> you propose to get the metrics in `TopologyTestDriver`. I don't think
> >>> people would monitor the `TopologyTestDriver` an thus wondering why it
> >>> is important to include the metrics there? Thread-level metric might be
> >>> easier to monitor though (ie, less different metric to monitor).
> >>>
> >>> 2) I am a little worried about WARN level logging and that it might be
> >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> level
> >>> might be better. Not 100% sure about this to be honest. What is the
> >>> general assumption about the frequency for skipped records? I could
> >>> imagine cases for which skipped records are quite frequent and thus,
> >>> WARN level logs might "flood" the logs
> >>>
> >>> One final remark:
> >>>
> >>>> More
> >>>> generally, I would like to establish a pattern in which we could add
> >> new
> >>>> values for the "reason" tags without needing a KIP to do so.
> >>>
> >>> From my understanding, this is not feasible. Changing metrics is always
> >>> considered a public API change, and we need a KIP for any change. As we
> >>> moved away from tagging, it doesn't matter for the KIP anymore -- just
> >>> wanted to point it out.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 3/30/18 2:47 PM, John Roesler wrote:
> >>>> Allrighty! The KIP is updated.
> >>>>
> >>>> Thanks again, all, for the feedback.
> >>>> -John
> >>>>
> >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io>
> >> wrote:
> >>>>
> >>>>> Hey Guozhang and Bill,
> >>>>>
> >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> like
> >>> to
> >>>>> put it in the task instead of the thread so that it'll show up in the
> >>>>> TopologyTestDriver metrics as well.
> >>>>>
> >>>>> I'm leaning toward keeping the scope where it is right now, but if
> >>> others
> >>>>> want to advocate for tossing in some more metrics, we can go that
> >> route.
> >>>>>
> >>>>> Thanks all,
> >>>>> -John
> >>>>>
> >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Thanks for the KIP John, and sorry for the late comments.
> >>>>>>
> >>>>>> I'm on the fence with providing a single level metrics, but I think
> >>> we'll
> >>>>>> have that discussion outside of this KIP.
> >>>>>>
> >>>>>>> * maintain one skipped-record metric (could be per-thread,
> per-task,
> >>> or
> >>>>>>> per-processor-node) with no "reason"
> >>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> >> and
> >>>>>>> reason of the skipped record
> >>>>>>
> >>>>>> I'm +1 on both of these suggestions.
> >>>>>>
> >>>>>> Finally, we have had requests in the past for some metrics around
> >> when
> >>>>>> persistent store removes an expired window.  Would adding that to
> our
> >>>>>> metrics stretch the scope of this KIP too much?
> >>>>>>
> >>>>>> Thanks again and overall I'm +1 on this KIP
> >>>>>>
> >>>>>> Bill
> >>>>>>
> >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wangg...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> The proposal sounds good to me. About "maintain only one level of
> >>>>>> metrics"
> >>>>>>> maybe we can discuss about that separately from this KIP since that
> >>>>>> would
> >>>>>>> be a larger scope of discussion. I agree that if we are going to
> >>>>>> maintain
> >>>>>>> only one-level metrics it should be lowest level and we would let
> >>> users
> >>>>>> to
> >>>>>>> do the roll-ups themselves, but I'm still not fully convinced that
> >> we
> >>>>>>> should just provide single-level metrics, because 1) I think for
> >>>>>> different
> >>>>>>> metrics people may be interested to investigate into different
> >>>>>>> granularities, e.g. for poll / commit rate these are at the lowest
> >>>>>>> task-level metrics, while for process-rate / skip-rate they can be
> >> as
> >>>>>> low
> >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not be
> >>> very
> >>>>>>> straight-forward. But for 2) if someone can provide an efficient
> and
> >>>>>> easy
> >>>>>>> implementation of that I can be persuaded :)
> >>>>>>>
> >>>>>>> For now I'm thinking we can add the metric on thread-level, either
> >>> with
> >>>>>>> finer grained ones with "reason" tag plus an aggregated one without
> >>> the
> >>>>>>> tag, or just having a single aggregated metric without the tag
> looks
> >>>>>> good
> >>>>>>> to me.
> >>>>>>>
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Guozhang,
> >>>>>>>>
> >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
> >>>>>> list in
> >>>>>>>> the KIP. I was also thinking we'd better start a documentation
> page
> >>>>>> with
> >>>>>>>> the metrics listed.
> >>>>>>>>
> >>>>>>>> I'd have no problem logging a warning when we skip records. On the
> >>>>>> metric
> >>>>>>>> front, really I'm just pushing for us to maintain only one level
> of
> >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> >> have a
> >>>>>>>> metric per reason and log the reason instead), that's fine by me.
> I
> >>>>>> just
> >>>>>>>> don't think it provides a lot of extra value per complexity
> >>> (interface
> >>>>>>> and
> >>>>>>>> implementation) to maintain roll-ups at the thread level in
> >> addition
> >>>>>> to
> >>>>>>>> lower-level metrics.
> >>>>>>>>
> >>>>>>>> How about this instead:
> >>>>>>>> * maintain one skipped-record metric (could be per-thread,
> >> per-task,
> >>>>>> or
> >>>>>>>> per-processor-node) with no "reason"
> >>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> >> and
> >>>>>>>> reason of the skipped record
> >>>>>>>>
> >>>>>>>> If you like that, I can update the KIP.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -John
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> wangg...@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>>>>> metrics
> >>>>>>>>> being easier than finer ones. All the metric alerting systems I
> >> have
> >>>>>>> used
> >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> >> my
> >>>>>>>>> experience doesn't say that this is a use case. Were you thinking
> >>>>>> of an
> >>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>
> >>>>>>>>> For the commonly used JMX reporter tags will be encoded directly
> >> as
> >>>>>>> part
> >>>>>>>> of
> >>>>>>>>> the object name, and if users wants to monitor them they need to
> >>>>>> know
> >>>>>>>> these
> >>>>>>>>> values before hand. That is also why I think we do want to list
> >> all
> >>>>>> the
> >>>>>>>>> possible values of the reason tags in the KIP, since
> >>>>>>>>>
> >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>> kind of
> >>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> >> all
> >>>>>> the
> >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> once,
> >>>>>> that
> >>>>>>>> it
> >>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>> available
> >>>>>>>>> information about the reason and location of skips all the time.
> >>>>>>>>>
> >>>>>>>>> That is a good point. I think we can either expose all levels
> >>>>>> metrics
> >>>>>>> as
> >>>>>>>> by
> >>>>>>>>> default, or only expose the most lower-level metrics and get rid
> >> of
> >>>>>>> other
> >>>>>>>>> levels to let users do roll-ups themselves (which will be a much
> >>>>>> larger
> >>>>>>>>> scope for discussion), or we can encourage users to not purely
> >>>>>> depend
> >>>>>>> on
> >>>>>>>>> metrics for such trouble shooting: that is to say, users only be
> >>>>>>> alerted
> >>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
> >>>>>> time we
> >>>>>>>> are
> >>>>>>>>> about to skip a record all over the places, so that upon being
> >>>>>> notified
> >>>>>>>>> users can look into the logs to find the details on where / when
> >> it
> >>>>>>>>> happens. WDYT?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the review.
> >>>>>>>>>>
> >>>>>>>>>> 1.
> >>>>>>>>>> Matthias raised the same question about the "reason" tag values.
> >> I
> >>>>>>> can
> >>>>>>>>> list
> >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
> >>>>>> level
> >>>>>>> of
> >>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> >>>>>> review
> >>>>>>>>> would
> >>>>>>>>>> be sufficient. If you all disagree and would like it included in
> >>>>>> the
> >>>>>>>>> KIP, I
> >>>>>>>>>> can certainly do that.
> >>>>>>>>>>
> >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> >>>>>> keeping
> >>>>>>>> the
> >>>>>>>>>> same name but eliminating the tags for the dimensions that were
> >>>>>>>>> rolled-up.
> >>>>>>>>>>
> >>>>>>>>>> 2.
> >>>>>>>>>> I'm not too sure that implementation efficiency really becomes a
> >>>>>>> factor
> >>>>>>>>> in
> >>>>>>>>>> choosing whether to (by default) update one coarse metric at the
> >>>>>>> thread
> >>>>>>>>>> level or one granular metric at the processor-node level, since
> >>>>>> it's
> >>>>>>>> just
> >>>>>>>>>> one metric being updated either way. I do agree that if we were
> >> to
> >>>>>>>> update
> >>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> >>>>>> consider
> >>>>>>> the
> >>>>>>>>>> efficiency.
> >>>>>>>>>>
> >>>>>>>>>> I agree it's probably not necessary to surface the metrics for
> >> all
> >>>>>>>> nodes
> >>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
> can
> >>>>>>>> lazily
> >>>>>>>>>> register the metrics.
> >>>>>>>>>>
> >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>> kind of
> >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> >>>>>> all
> >>>>>>> the
> >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> >> once,
> >>>>>>> that
> >>>>>>>>> it
> >>>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>> available
> >>>>>>>>>> information about the reason and location of skips all the time.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>>>>> metrics
> >>>>>>>>>> being easier than finer ones. All the metric alerting systems I
> >>>>>> have
> >>>>>>>> used
> >>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> >>>>>> my
> >>>>>>>>>> experience doesn't say that this is a use case. Were you
> thinking
> >>>>>> of
> >>>>>>> an
> >>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>>
> >>>>>>>>>> Thanks again,
> >>>>>>>>>> -John
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> >>>>>> wangg...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello John,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Could you list all the possible values of the "reason" tag?
> >>>>>> In
> >>>>>>> the
> >>>>>>>>>> JIRA
> >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> you're
> >>>>>>>> going
> >>>>>>>>> to
> >>>>>>>>>>> categorize each of them as a separate reason, or is there any
> >>>>>>>>> additional
> >>>>>>>>>>> ones you have in mind.
> >>>>>>>>>>>
> >>>>>>>>>>> Also I'm wondering if we should add another metric that do not
> >>>>>> have
> >>>>>>>> the
> >>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
> >>>>>> for
> >>>>>>>> users
> >>>>>>>>>> to
> >>>>>>>>>>> easily set their alerting notifications (otherwise they have to
> >>>>>>> write
> >>>>>>>>> on
> >>>>>>>>>>> notification rule per reason) in their monitoring systems.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Note that the processor-node metrics is actually
> "per-thread,
> >>>>>>>>>> per-task,
> >>>>>>>>>>> per-processor-node", and today we only set the per-thread
> >>>>>> metrics
> >>>>>>> as
> >>>>>>>>> INFO
> >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
> >>>>>>>> argument
> >>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
> >>>>>> I'm
> >>>>>>>>>> convinced
> >>>>>>>>>>> that the right way to approach it would be
> >>>>>>>> "just-providing-the-lowest-
> >>>>>>>>>>> level
> >>>>>>>>>>> metrics only".
> >>>>>>>>>>>
> >>>>>>>>>>> Note the recoding implementation of these three levels are
> >>>>>>> different
> >>>>>>>>>>> internally today: we did not just do the rolling up to generate
> >>>>>> the
> >>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> >>>>>> record
> >>>>>>>> them
> >>>>>>>>>>> separately, which means that, if we turn on multiple levels of
> >>>>>>>> metrics,
> >>>>>>>>>> we
> >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that is
> >>>>>> not
> >>>>>>>> the
> >>>>>>>>>> best
> >>>>>>>>>>> way to represent multi-level metrics collecting and reporting,
> >>>>>> but
> >>>>>>> by
> >>>>>>>>>> only
> >>>>>>>>>>> enabling thread-level metrics as INFO today, that
> implementation
> >>>>>>>> could
> >>>>>>>>> be
> >>>>>>>>>>> more efficient than only collecting the metrics at the lowest
> >>>>>>> level,
> >>>>>>>>> and
> >>>>>>>>>>> then do the roll-up calculations outside of the metrics
> classes.
> >>>>>>>>>>>
> >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip records,
> >>>>>>> AFAIK
> >>>>>>>> we
> >>>>>>>>>>> will only skip records at the source, sink, window and
> >>>>>> aggregation
> >>>>>>>>>>> processor nodes, so adding a metric per processor looks like an
> >>>>>>>>> overkill
> >>>>>>>>>> to
> >>>>>>>>>>> me as well. On the other hand, from user's perspective the
> >>>>>> "reason"
> >>>>>>>> tag
> >>>>>>>>>> may
> >>>>>>>>>>> be sufficient for them to narrow down where inside the topology
> >>>>>> is
> >>>>>>>>>> causing
> >>>>>>>>>>> records to be dropped on the floor. So I think the "per-thread,
> >>>>>>>>> per-task"
> >>>>>>>>>>> level metrics should be sufficient for them in trouble shoot in
> >>>>>>> DEBUG
> >>>>>>>>>> mode,
> >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO which
> >>>>>> is
> >>>>>>>>> turned
> >>>>>>>>>>> on by default. So under normal execution users still only need
> >>>>>> INFO
> >>>>>>>>> level
> >>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> >>>>>>> metrics
> >>>>>>>> as
> >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> DEBUG
> >>>>>>>>> metrics
> >>>>>>>>>> to
> >>>>>>>>>>> look into which task is actually causing the skipped records.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> >>>>>>>>> matth...@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering what
> >>>>>>>>> `reason`
> >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
> >>>>>>> list
> >>>>>>>>>> those
> >>>>>>>>>>>> IMHO.
> >>>>>>>>>>>>
> >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> >>>>>>> simultaneously
> >>>>>>>>>>>>
> >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
> >>>>>>> that
> >>>>>>>>> we
> >>>>>>>>>>>> should put the fine grained metrics into INFO level and remove
> >>>>>>> the
> >>>>>>>>>>>> roll-up at thread level.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Given that they have to do this sum to get a usable
> >>>>>> top-level
> >>>>>>>> view
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> >>>>>>> Offering
> >>>>>>>> a
> >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
> >>>>>> be
> >>>>>>> a
> >>>>>>>>>>>> better solution. In the past we did not offer this due to
> >>>>>>>> performance
> >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> >>>>>>>> disagree,
> >>>>>>>>>> can
> >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> >>>>>>>>> alternatives"
> >>>>>>>>>>>> section.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
> >>>>>> view,
> >>>>>>>> but
> >>>>>>>>>>>> instead of going more fine grained, we should consider to add
> >>>>>>> this
> >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
> >>>>>>> level
> >>>>>>>>>> only
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
> >>>>>>>> metrics
> >>>>>>>>>>>> directly as you suggested.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> >>>>>>>>>>>>> Looks good to me.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> >>>>>>> j...@confluent.io
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> >>>>>> skipped
> >>>>>>>>> records
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> Streams.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please find the details here:
> >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please let me know what you think!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> -- Guozhang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to