Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Bill Bejeck Fri, 30 Mar 2018 12:37:38 -0700

Thanks for the KIP John, and sorry for the late comments.

I'm on the fence with providing a single level metrics, but I think we'll
have that discussion outside of this KIP.


> * maintain one skipped-record metric (could be per-thread, per-task, or
> per-processor-node) with no "reason"
> * introduce a warn-level log detailing the topic/partition/offset and
> reason of the skipped record

I'm +1 on both of these suggestions.

Finally, we have had requests in the past for some metrics around when
persistent store removes an expired window.  Would adding that to our
metrics stretch the scope of this KIP too much?

Thanks again and overall I'm +1 on this KIP

Bill

On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wangg...@gmail.com> wrote:

> The proposal sounds good to me. About "maintain only one level of metrics"
> maybe we can discuss about that separately from this KIP since that would
> be a larger scope of discussion. I agree that if we are going to maintain
> only one-level metrics it should be lowest level and we would let users to
> do the roll-ups themselves, but I'm still not fully convinced that we
> should just provide single-level metrics, because 1) I think for different
> metrics people may be interested to investigate into different
> granularities, e.g. for poll / commit rate these are at the lowest
> task-level metrics, while for process-rate / skip-rate they can be as low
> as processor-node metrics, and 2) user-side rolling ups may not be very
> straight-forward. But for 2) if someone can provide an efficient and easy
> implementation of that I can be persuaded :)
>
> For now I'm thinking we can add the metric on thread-level, either with
> finer grained ones with "reason" tag plus an aggregated one without the
> tag, or just having a single aggregated metric without the tag looks good
> to me.
>
>
> Guozhang
>
> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io> wrote:
>
> > Hey Guozhang,
> >
> > Thanks for the reply. Regarding JMX, I can dig it. I'll provide a list in
> > the KIP. I was also thinking we'd better start a documentation page with
> > the metrics listed.
> >
> > I'd have no problem logging a warning when we skip records. On the metric
> > front, really I'm just pushing for us to maintain only one level of
> > metrics. If that's more or less granular (i.e., maybe we don't have a
> > metric per reason and log the reason instead), that's fine by me. I just
> > don't think it provides a lot of extra value per complexity (interface
> and
> > implementation) to maintain roll-ups at the thread level in addition to
> > lower-level metrics.
> >
> > How about this instead:
> > * maintain one skipped-record metric (could be per-thread, per-task, or
> > per-processor-node) with no "reason"
> > * introduce a warn-level log detailing the topic/partition/offset and
> > reason of the skipped record
> >
> > If you like that, I can update the KIP.
> >
> > Thanks,
> > -John
> >
> >
> >
> > On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > > One thing you mention is the notion of setting alerts on coarser
> > metrics
> > > being easier than finer ones. All the metric alerting systems I have
> used
> > > make it equally easy to alert on metrics by-tag or over tags. So my
> > > experience doesn't say that this is a use case. Were you thinking of an
> > > alerting system that makes such a pre-aggregation valuable?
> > >
> > > For the commonly used JMX reporter tags will be encoded directly as
> part
> > of
> > > the object name, and if users wants to monitor them they need to know
> > these
> > > values before hand. That is also why I think we do want to list all the
> > > possible values of the reason tags in the KIP, since
> > >
> > > > In my email in response to Matthias, I gave an example of the kind of
> > > scenario that would lead me as an operator to run with DEBUG on all the
> > > time, since I wouldn't be sure, having seen a skipped record once, that
> > it
> > > would ever happen again. The solution is to capture all the available
> > > information about the reason and location of skips all the time.
> > >
> > > That is a good point. I think we can either expose all levels metrics
> as
> > by
> > > default, or only expose the most lower-level metrics and get rid of
> other
> > > levels to let users do roll-ups themselves (which will be a much larger
> > > scope for discussion), or we can encourage users to not purely depend
> on
> > > metrics for such trouble shooting: that is to say, users only be
> alerted
> > > based on metrics, and we can log a info / warn log4j entry each time we
> > are
> > > about to skip a record all over the places, so that upon being notified
> > > users can look into the logs to find the details on where / when it
> > > happens. WDYT?
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io>
> wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > Thanks for the review.
> > > >
> > > > 1.
> > > > Matthias raised the same question about the "reason" tag values. I
> can
> > > list
> > > > all possible values of the "reason" tag, but I'm thinking this level
> of
> > > > detail may not be KIP-worthy, maybe the code and documentation review
> > > would
> > > > be sufficient. If you all disagree and would like it included in the
> > > KIP, I
> > > > can certainly do that.
> > > >
> > > > If we do provide roll-up metrics, I agree with the pattern of keeping
> > the
> > > > same name but eliminating the tags for the dimensions that were
> > > rolled-up.
> > > >
> > > > 2.
> > > > I'm not too sure that implementation efficiency really becomes a
> factor
> > > in
> > > > choosing whether to (by default) update one coarse metric at the
> thread
> > > > level or one granular metric at the processor-node level, since it's
> > just
> > > > one metric being updated either way. I do agree that if we were to
> > update
> > > > the granular metrics and multiple roll-ups, then we should consider
> the
> > > > efficiency.
> > > >
> > > > I agree it's probably not necessary to surface the metrics for all
> > nodes
> > > > regardless of whether they can or do skip records. Perhaps we can
> > lazily
> > > > register the metrics.
> > > >
> > > > In my email in response to Matthias, I gave an example of the kind of
> > > > scenario that would lead me as an operator to run with DEBUG on all
> the
> > > > time, since I wouldn't be sure, having seen a skipped record once,
> that
> > > it
> > > > would ever happen again. The solution is to capture all the available
> > > > information about the reason and location of skips all the time.
> > > >
> > > >
> > > >
> > > > One thing you mention is the notion of setting alerts on coarser
> > metrics
> > > > being easier than finer ones. All the metric alerting systems I have
> > used
> > > > make it equally easy to alert on metrics by-tag or over tags. So my
> > > > experience doesn't say that this is a use case. Were you thinking of
> an
> > > > alerting system that makes such a pre-aggregation valuable?
> > > >
> > > > Thanks again,
> > > > -John
> > > >
> > > >
> > > > On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello John,
> > > > >
> > > > > Thanks for the KIP. Some comments:
> > > > >
> > > > > 1. Could you list all the possible values of the "reason" tag? In
> the
> > > > JIRA
> > > > > ticket I left some potential reasons but I'm not clear if you're
> > going
> > > to
> > > > > categorize each of them as a separate reason, or is there any
> > > additional
> > > > > ones you have in mind.
> > > > >
> > > > > Also I'm wondering if we should add another metric that do not have
> > the
> > > > > reason tag but aggregates among all possible reasons? This is for
> > users
> > > > to
> > > > > easily set their alerting notifications (otherwise they have to
> write
> > > on
> > > > > notification rule per reason) in their monitoring systems.
> > > > >
> > > > > 2. Note that the processor-node metrics is actually "per-thread,
> > > > per-task,
> > > > > per-processor-node", and today we only set the per-thread metrics
> as
> > > INFO
> > > > > while leaving the lower two layers as DEBUG. I agree with your
> > argument
> > > > > that we are missing the per-client roll-up metrics today, but I'm
> > > > convinced
> > > > > that the right way to approach it would be
> > "just-providing-the-lowest-
> > > > > level
> > > > > metrics only".
> > > > >
> > > > > Note the recoding implementation of these three levels are
> different
> > > > > internally today: we did not just do the rolling up to generate the
> > > > > higher-level metrics from the lower level ones, but we just record
> > them
> > > > > separately, which means that, if we turn on multiple levels of
> > metrics,
> > > > we
> > > > > maybe duplicate collecting some metrics. One can argue that is not
> > the
> > > > best
> > > > > way to represent multi-level metrics collecting and reporting, but
> by
> > > > only
> > > > > enabling thread-level metrics as INFO today, that implementation
> > could
> > > be
> > > > > more efficient than only collecting the metrics at the lowest
> level,
> > > and
> > > > > then do the roll-up calculations outside of the metrics classes.
> > > > >
> > > > > Plus, today not all processor-nodes may possibly skip records,
> AFAIK
> > we
> > > > > will only skip records at the source, sink, window and aggregation
> > > > > processor nodes, so adding a metric per processor looks like an
> > > overkill
> > > > to
> > > > > me as well. On the other hand, from user's perspective the "reason"
> > tag
> > > > may
> > > > > be sufficient for them to narrow down where inside the topology is
> > > > causing
> > > > > records to be dropped on the floor. So I think the "per-thread,
> > > per-task"
> > > > > level metrics should be sufficient for them in trouble shoot in
> DEBUG
> > > > mode,
> > > > > and we can add another "per-thread" level metrics as INFO which is
> > > turned
> > > > > on by default. So under normal execution users still only need INFO
> > > level
> > > > > metrics for alerting (e.g. set alerts on all skipped-records
> metrics
> > as
> > > > > non-zero), and then upon trouble shooting they can turn on DEBUG
> > > metrics
> > > > to
> > > > > look into which task is actually causing the skipped records.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > > matth...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the KIP John.
> > > > > >
> > > > > > Reading the material on the related Jiras, I am wondering what
> > > `reason`
> > > > > > tags you want to introduce? Can you elaborate? The KIP should
> list
> > > > those
> > > > > > IMHO.
> > > > > >
> > > > > > About the fine grained metrics vs the roll-up: you say that
> > > > > >
> > > > > > > the coarse metric aggregates across two dimensions
> simultaneously
> > > > > >
> > > > > > Can you elaborate why this is an issue? I am not convinced atm
> that
> > > we
> > > > > > should put the fine grained metrics into INFO level and remove
> the
> > > > > > roll-up at thread level.
> > > > > >
> > > > > > > Given that they have to do this sum to get a usable top-level
> > view
> > > > > >
> > > > > > This is a fair concern, but I don't share the conclusion.
> Offering
> > a
> > > > > > built-in `KafkaStreams` "client" roll-up out of the box might be
> a
> > > > > > better solution. In the past we did not offer this due to
> > performance
> > > > > > concerns, but we could allow an "opt-in" mechanism. If you
> > disagree,
> > > > can
> > > > > > you provide some reasoning and add them to the "Rejected
> > > alternatives"
> > > > > > section.
> > > > > >
> > > > > > To rephrase: I understand the issue about missing top-level view,
> > but
> > > > > > instead of going more fine grained, we should consider to add
> this
> > > > > > top-level view and add/keep the fine grained metrics at DEBUG
> level
> > > > only
> > > > > >
> > > > > > I am +1 to add TopologyTestDriver#metrics() and to remove old
> > metrics
> > > > > > directly as you suggested.
> > > > > >
> > > > > >
> > > > > > -Matthias
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 3/28/18 6:42 PM, Ted Yu wrote:
> > > > > > > Looks good to me.
> > > > > > >
> > > > > > > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> j...@confluent.io
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Hello all,
> > > > > > >>
> > > > > > >> I am proposing KIP-274 to improve the metrics around skipped
> > > records
> > > > > in
> > > > > > >> Streams.
> > > > > > >>
> > > > > > >> Please find the details here:
> > > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > >> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > > > > >>
> > > > > > >> Please let me know what you think!
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> -John
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to