Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Guozhang Wang Tue, 03 Apr 2018 11:27:07 -0700

I think Matthias' comment is that, we can still record the metrics on the
thread-level, while having the WARN log entry to include sufficient context
information so that users can still easily narrow down the investigation
scope.



Guozhang

On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote:

> I agree we should add as much information as is reasonable to the log. For
> example, see this WIP PR I started for this KIP:
>
> https://github.com/apache/kafka/pull/4812/files#diff-
> 88d129f048bc842c7db5b2566a45fce8R80
>
> and
>
> https://github.com/apache/kafka/pull/4812/files#diff-
> 69e6789eb675ec978a1abd24fed96eb1R111
>
> I'm not sure if we should nail down the log messages in the KIP or in the
> PR discussion. What say you?
>
> Thanks,
> -John
>
> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io>
> wrote:
>
> > Thanks for sharing your thoughts. As I mentioned originally, I am not
> > sure about the right log level either. Your arguments are convincing --
> > thus, I am fine with keeping WARN level.
> >
> > The task vs thread level argument is an interesting one. However, I am
> > wondering if we should add this information into the corresponding WARN
> > logs that we write anyway? For this case, we can also log the
> > corresponding operator (and other information like topic name etc if
> > needed). WDYT about this?
> >
> >
> > -Matthias
> >
> > On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > Regarding logging: I'm inclined to keep logging at WARN level since
> > skipped
> > > records are not expected in normal execution (for all reasons that we
> are
> > > aware of), and hence when error happens users should be alerted from
> > > metrics and looked into the log files, so to me if it is really
> spamming
> > > the log files it is also a good alert for users. Besides for
> deserialize
> > > errors we already log at WARN level for this reason.
> > >
> > > Regarding the metrics-levels: I was pondering on that as well. What
> made
> > me
> > > to think and agree on task-level than thread-level is that for some
> > reasons
> > > like window retention, they may possibly be happening on a subset of
> > input
> > > partitions, and tasks are correlated with partitions the task-level
> > metrics
> > > can help users to narrow down on the specific input data partitions.
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io>
> wrote:
> > >
> > >> Hi Matthias,
> > >>
> > >> No worries! Thanks for the reply.
> > >>
> > >> 1) There isn't a connection. I tried using the TopologyTestDriver to
> > write
> > >> a quick test exercising the current behavior and discovered that the
> > >> metrics weren't available. It seemed like they should be, so I tacked
> > it on
> > >> to this KIP. If you feel it's inappropriate, I can pull it back out.
> > >>
> > >> 2) I was also concerned about that, but I figured it would come up in
> > >> discussion if I just went ahead and proposed it. And here we are!
> > >>
> > >> Here's my thought: maybe there are two classes of skips: "controlled"
> > and
> > >> "uncontrolled", where "controlled" means, as an app author, I
> > deliberately
> > >> filter out some events, and "uncontrolled" means that I simply don't
> > >> account for some feature of the data, and the framework skips them (as
> > >> opposed to crashing).
> > >>
> > >> In this breakdowns, the skips I'm adding metrics for are all
> > uncontrolled
> > >> skips (and we hope to measure all the uncontrolled skips). Our skips
> are
> > >> well documented, so it wouldn't be terrible to have an application in
> > which
> > >> you know you expect to have tons of uncontrolled skips, but it's not
> > great
> > >> either, since you may also have some *unexpected* uncontrolled skips.
> > It'll
> > >> be difficult to notice, since you're probably not alerting on the
> metric
> > >> and filtering out the logs (whatever their level).
> > >>
> > >> I'd recommend any app author, as an alternative, to convert all
> expected
> > >> skips to controlled ones, by updating the topology to filter those
> > records
> > >> out.
> > >>
> > >> Following from my recommendation, as a library author, I'm inclined to
> > mark
> > >> those logs WARN, since in my opinion, they should be concerning to the
> > app
> > >> authors. I'd definitely want to show, rather than hide, them by
> > default, so
> > >> I would pick INFO at least.
> > >>
> > >> That said, logging is always a tricky issue for lower-level libraries
> > that
> > >> run inside user code, since we don't have all the information we need
> to
> > >> make the right call.
> > >>
> > >>
> > >>
> > >> On your last note, yeah, I got that impression from Guozhang as well.
> > >> Thanks for the clarification.
> > >>
> > >> -John
> > >>
> > >>
> > >>
> > >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> matth...@confluent.io>
> > >> wrote:
> > >>
> > >>> John,
> > >>>
> > >>> sorry for my late reply and thanks for updating the KIP.
> > >>>
> > >>> I like your approach about "metrics are for monitoring, logs are for
> > >>> debugging" -- however:
> > >>>
> > >>> 1) I don't see a connection between this and the task-level metrics
> > that
> > >>> you propose to get the metrics in `TopologyTestDriver`. I don't think
> > >>> people would monitor the `TopologyTestDriver` an thus wondering why
> it
> > >>> is important to include the metrics there? Thread-level metric might
> be
> > >>> easier to monitor though (ie, less different metric to monitor).
> > >>>
> > >>> 2) I am a little worried about WARN level logging and that it might
> be
> > >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> > level
> > >>> might be better. Not 100% sure about this to be honest. What is the
> > >>> general assumption about the frequency for skipped records? I could
> > >>> imagine cases for which skipped records are quite frequent and thus,
> > >>> WARN level logs might "flood" the logs
> > >>>
> > >>> One final remark:
> > >>>
> > >>>> More
> > >>>> generally, I would like to establish a pattern in which we could add
> > >> new
> > >>>> values for the "reason" tags without needing a KIP to do so.
> > >>>
> > >>> From my understanding, this is not feasible. Changing metrics is
> always
> > >>> considered a public API change, and we need a KIP for any change. As
> we
> > >>> moved away from tagging, it doesn't matter for the KIP anymore --
> just
> > >>> wanted to point it out.
> > >>>
> > >>>
> > >>> -Matthias
> > >>>
> > >>>
> > >>> On 3/30/18 2:47 PM, John Roesler wrote:
> > >>>> Allrighty! The KIP is updated.
> > >>>>
> > >>>> Thanks again, all, for the feedback.
> > >>>> -John
> > >>>>
> > >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io>
> > >> wrote:
> > >>>>
> > >>>>> Hey Guozhang and Bill,
> > >>>>>
> > >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> > like
> > >>> to
> > >>>>> put it in the task instead of the thread so that it'll show up in
> the
> > >>>>> TopologyTestDriver metrics as well.
> > >>>>>
> > >>>>> I'm leaning toward keeping the scope where it is right now, but if
> > >>> others
> > >>>>> want to advocate for tossing in some more metrics, we can go that
> > >> route.
> > >>>>>
> > >>>>> Thanks all,
> > >>>>> -John
> > >>>>>
> > >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Thanks for the KIP John, and sorry for the late comments.
> > >>>>>>
> > >>>>>> I'm on the fence with providing a single level metrics, but I
> think
> > >>> we'll
> > >>>>>> have that discussion outside of this KIP.
> > >>>>>>
> > >>>>>>> * maintain one skipped-record metric (could be per-thread,
> > per-task,
> > >>> or
> > >>>>>>> per-processor-node) with no "reason"
> > >>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> > >> and
> > >>>>>>> reason of the skipped record
> > >>>>>>
> > >>>>>> I'm +1 on both of these suggestions.
> > >>>>>>
> > >>>>>> Finally, we have had requests in the past for some metrics around
> > >> when
> > >>>>>> persistent store removes an expired window.  Would adding that to
> > our
> > >>>>>> metrics stretch the scope of this KIP too much?
> > >>>>>>
> > >>>>>> Thanks again and overall I'm +1 on this KIP
> > >>>>>>
> > >>>>>> Bill
> > >>>>>>
> > >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> wangg...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> The proposal sounds good to me. About "maintain only one level of
> > >>>>>> metrics"
> > >>>>>>> maybe we can discuss about that separately from this KIP since
> that
> > >>>>>> would
> > >>>>>>> be a larger scope of discussion. I agree that if we are going to
> > >>>>>> maintain
> > >>>>>>> only one-level metrics it should be lowest level and we would let
> > >>> users
> > >>>>>> to
> > >>>>>>> do the roll-ups themselves, but I'm still not fully convinced
> that
> > >> we
> > >>>>>>> should just provide single-level metrics, because 1) I think for
> > >>>>>> different
> > >>>>>>> metrics people may be interested to investigate into different
> > >>>>>>> granularities, e.g. for poll / commit rate these are at the
> lowest
> > >>>>>>> task-level metrics, while for process-rate / skip-rate they can
> be
> > >> as
> > >>>>>> low
> > >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
> be
> > >>> very
> > >>>>>>> straight-forward. But for 2) if someone can provide an efficient
> > and
> > >>>>>> easy
> > >>>>>>> implementation of that I can be persuaded :)
> > >>>>>>>
> > >>>>>>> For now I'm thinking we can add the metric on thread-level,
> either
> > >>> with
> > >>>>>>> finer grained ones with "reason" tag plus an aggregated one
> without
> > >>> the
> > >>>>>>> tag, or just having a single aggregated metric without the tag
> > looks
> > >>>>>> good
> > >>>>>>> to me.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Guozhang
> > >>>>>>>
> > >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hey Guozhang,
> > >>>>>>>>
> > >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide
> a
> > >>>>>> list in
> > >>>>>>>> the KIP. I was also thinking we'd better start a documentation
> > page
> > >>>>>> with
> > >>>>>>>> the metrics listed.
> > >>>>>>>>
> > >>>>>>>> I'd have no problem logging a warning when we skip records. On
> the
> > >>>>>> metric
> > >>>>>>>> front, really I'm just pushing for us to maintain only one level
> > of
> > >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> > >> have a
> > >>>>>>>> metric per reason and log the reason instead), that's fine by
> me.
> > I
> > >>>>>> just
> > >>>>>>>> don't think it provides a lot of extra value per complexity
> > >>> (interface
> > >>>>>>> and
> > >>>>>>>> implementation) to maintain roll-ups at the thread level in
> > >> addition
> > >>>>>> to
> > >>>>>>>> lower-level metrics.
> > >>>>>>>>
> > >>>>>>>> How about this instead:
> > >>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > >> per-task,
> > >>>>>> or
> > >>>>>>>> per-processor-node) with no "reason"
> > >>>>>>>> * introduce a warn-level log detailing the
> topic/partition/offset
> > >> and
> > >>>>>>>> reason of the skipped record
> > >>>>>>>>
> > >>>>>>>> If you like that, I can update the KIP.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> -John
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > wangg...@gmail.com
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> coarser
> > >>>>>>>> metrics
> > >>>>>>>>> being easier than finer ones. All the metric alerting systems I
> > >> have
> > >>>>>>> used
> > >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> So
> > >> my
> > >>>>>>>>> experience doesn't say that this is a use case. Were you
> thinking
> > >>>>>> of an
> > >>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>
> > >>>>>>>>> For the commonly used JMX reporter tags will be encoded
> directly
> > >> as
> > >>>>>>> part
> > >>>>>>>> of
> > >>>>>>>>> the object name, and if users wants to monitor them they need
> to
> > >>>>>> know
> > >>>>>>>> these
> > >>>>>>>>> values before hand. That is also why I think we do want to list
> > >> all
> > >>>>>> the
> > >>>>>>>>> possible values of the reason tags in the KIP, since
> > >>>>>>>>>
> > >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> > >>>>>> kind of
> > >>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> > >> all
> > >>>>>> the
> > >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > once,
> > >>>>>> that
> > >>>>>>>> it
> > >>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>> available
> > >>>>>>>>> information about the reason and location of skips all the
> time.
> > >>>>>>>>>
> > >>>>>>>>> That is a good point. I think we can either expose all levels
> > >>>>>> metrics
> > >>>>>>> as
> > >>>>>>>> by
> > >>>>>>>>> default, or only expose the most lower-level metrics and get
> rid
> > >> of
> > >>>>>>> other
> > >>>>>>>>> levels to let users do roll-ups themselves (which will be a
> much
> > >>>>>> larger
> > >>>>>>>>> scope for discussion), or we can encourage users to not purely
> > >>>>>> depend
> > >>>>>>> on
> > >>>>>>>>> metrics for such trouble shooting: that is to say, users only
> be
> > >>>>>>> alerted
> > >>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
> > >>>>>> time we
> > >>>>>>>> are
> > >>>>>>>>> about to skip a record all over the places, so that upon being
> > >>>>>> notified
> > >>>>>>>>> users can look into the logs to find the details on where /
> when
> > >> it
> > >>>>>>>>> happens. WDYT?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Guozhang
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> j...@confluent.io
> > >
> > >>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the review.
> > >>>>>>>>>>
> > >>>>>>>>>> 1.
> > >>>>>>>>>> Matthias raised the same question about the "reason" tag
> values.
> > >> I
> > >>>>>>> can
> > >>>>>>>>> list
> > >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
> > >>>>>> level
> > >>>>>>> of
> > >>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> > >>>>>> review
> > >>>>>>>>> would
> > >>>>>>>>>> be sufficient. If you all disagree and would like it included
> in
> > >>>>>> the
> > >>>>>>>>> KIP, I
> > >>>>>>>>>> can certainly do that.
> > >>>>>>>>>>
> > >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> > >>>>>> keeping
> > >>>>>>>> the
> > >>>>>>>>>> same name but eliminating the tags for the dimensions that
> were
> > >>>>>>>>> rolled-up.
> > >>>>>>>>>>
> > >>>>>>>>>> 2.
> > >>>>>>>>>> I'm not too sure that implementation efficiency really
> becomes a
> > >>>>>>> factor
> > >>>>>>>>> in
> > >>>>>>>>>> choosing whether to (by default) update one coarse metric at
> the
> > >>>>>>> thread
> > >>>>>>>>>> level or one granular metric at the processor-node level,
> since
> > >>>>>> it's
> > >>>>>>>> just
> > >>>>>>>>>> one metric being updated either way. I do agree that if we
> were
> > >> to
> > >>>>>>>> update
> > >>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> > >>>>>> consider
> > >>>>>>> the
> > >>>>>>>>>> efficiency.
> > >>>>>>>>>>
> > >>>>>>>>>> I agree it's probably not necessary to surface the metrics for
> > >> all
> > >>>>>>>> nodes
> > >>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
> > can
> > >>>>>>>> lazily
> > >>>>>>>>>> register the metrics.
> > >>>>>>>>>>
> > >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> > >>>>>> kind of
> > >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> on
> > >>>>>> all
> > >>>>>>> the
> > >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > >> once,
> > >>>>>>> that
> > >>>>>>>>> it
> > >>>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>> available
> > >>>>>>>>>> information about the reason and location of skips all the
> time.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> coarser
> > >>>>>>>> metrics
> > >>>>>>>>>> being easier than finer ones. All the metric alerting systems
> I
> > >>>>>> have
> > >>>>>>>> used
> > >>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> So
> > >>>>>> my
> > >>>>>>>>>> experience doesn't say that this is a use case. Were you
> > thinking
> > >>>>>> of
> > >>>>>>> an
> > >>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks again,
> > >>>>>>>>>> -John
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > >>>>>> wangg...@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hello John,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the KIP. Some comments:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Could you list all the possible values of the "reason"
> tag?
> > >>>>>> In
> > >>>>>>> the
> > >>>>>>>>>> JIRA
> > >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> > you're
> > >>>>>>>> going
> > >>>>>>>>> to
> > >>>>>>>>>>> categorize each of them as a separate reason, or is there any
> > >>>>>>>>> additional
> > >>>>>>>>>>> ones you have in mind.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Also I'm wondering if we should add another metric that do
> not
> > >>>>>> have
> > >>>>>>>> the
> > >>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
> > >>>>>> for
> > >>>>>>>> users
> > >>>>>>>>>> to
> > >>>>>>>>>>> easily set their alerting notifications (otherwise they have
> to
> > >>>>>>> write
> > >>>>>>>>> on
> > >>>>>>>>>>> notification rule per reason) in their monitoring systems.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > "per-thread,
> > >>>>>>>>>> per-task,
> > >>>>>>>>>>> per-processor-node", and today we only set the per-thread
> > >>>>>> metrics
> > >>>>>>> as
> > >>>>>>>>> INFO
> > >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
> your
> > >>>>>>>> argument
> > >>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
> > >>>>>> I'm
> > >>>>>>>>>> convinced
> > >>>>>>>>>>> that the right way to approach it would be
> > >>>>>>>> "just-providing-the-lowest-
> > >>>>>>>>>>> level
> > >>>>>>>>>>> metrics only".
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note the recoding implementation of these three levels are
> > >>>>>>> different
> > >>>>>>>>>>> internally today: we did not just do the rolling up to
> generate
> > >>>>>> the
> > >>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> > >>>>>> record
> > >>>>>>>> them
> > >>>>>>>>>>> separately, which means that, if we turn on multiple levels
> of
> > >>>>>>>> metrics,
> > >>>>>>>>>> we
> > >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
> is
> > >>>>>> not
> > >>>>>>>> the
> > >>>>>>>>>> best
> > >>>>>>>>>>> way to represent multi-level metrics collecting and
> reporting,
> > >>>>>> but
> > >>>>>>> by
> > >>>>>>>>>> only
> > >>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > implementation
> > >>>>>>>> could
> > >>>>>>>>> be
> > >>>>>>>>>>> more efficient than only collecting the metrics at the lowest
> > >>>>>>> level,
> > >>>>>>>>> and
> > >>>>>>>>>>> then do the roll-up calculations outside of the metrics
> > classes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> records,
> > >>>>>>> AFAIK
> > >>>>>>>> we
> > >>>>>>>>>>> will only skip records at the source, sink, window and
> > >>>>>> aggregation
> > >>>>>>>>>>> processor nodes, so adding a metric per processor looks like
> an
> > >>>>>>>>> overkill
> > >>>>>>>>>> to
> > >>>>>>>>>>> me as well. On the other hand, from user's perspective the
> > >>>>>> "reason"
> > >>>>>>>> tag
> > >>>>>>>>>> may
> > >>>>>>>>>>> be sufficient for them to narrow down where inside the
> topology
> > >>>>>> is
> > >>>>>>>>>> causing
> > >>>>>>>>>>> records to be dropped on the floor. So I think the
> "per-thread,
> > >>>>>>>>> per-task"
> > >>>>>>>>>>> level metrics should be sufficient for them in trouble shoot
> in
> > >>>>>>> DEBUG
> > >>>>>>>>>> mode,
> > >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
> which
> > >>>>>> is
> > >>>>>>>>> turned
> > >>>>>>>>>>> on by default. So under normal execution users still only
> need
> > >>>>>> INFO
> > >>>>>>>>> level
> > >>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> > >>>>>>> metrics
> > >>>>>>>> as
> > >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> > DEBUG
> > >>>>>>>>> metrics
> > >>>>>>>>>> to
> > >>>>>>>>>>> look into which task is actually causing the skipped records.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Guozhang
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > >>>>>>>>> matth...@confluent.io>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the KIP John.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
> what
> > >>>>>>>>> `reason`
> > >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> should
> > >>>>>>> list
> > >>>>>>>>>> those
> > >>>>>>>>>>>> IMHO.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > >>>>>>> simultaneously
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
> atm
> > >>>>>>> that
> > >>>>>>>>> we
> > >>>>>>>>>>>> should put the fine grained metrics into INFO level and
> remove
> > >>>>>>> the
> > >>>>>>>>>>>> roll-up at thread level.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > >>>>>> top-level
> > >>>>>>>> view
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> > >>>>>>> Offering
> > >>>>>>>> a
> > >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> might
> > >>>>>> be
> > >>>>>>> a
> > >>>>>>>>>>>> better solution. In the past we did not offer this due to
> > >>>>>>>> performance
> > >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> > >>>>>>>> disagree,
> > >>>>>>>>>> can
> > >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> > >>>>>>>>> alternatives"
> > >>>>>>>>>>>> section.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
> > >>>>>> view,
> > >>>>>>>> but
> > >>>>>>>>>>>> instead of going more fine grained, we should consider to
> add
> > >>>>>>> this
> > >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> DEBUG
> > >>>>>>> level
> > >>>>>>>>>> only
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
> old
> > >>>>>>>> metrics
> > >>>>>>>>>>>> directly as you suggested.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > >>>>>>>>>>>>> Looks good to me.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > >>>>>>> j...@confluent.io
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > >>>>>> skipped
> > >>>>>>>>> records
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>> Streams.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please find the details here:
> > >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please let me know what you think!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> -- Guozhang
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> -- Guozhang
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to