Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Guozhang Wang Fri, 30 Mar 2018 11:01:35 -0700

The proposal sounds good to me. About "maintain only one level of metrics"
maybe we can discuss about that separately from this KIP since that would
be a larger scope of discussion. I agree that if we are going to maintain
only one-level metrics it should be lowest level and we would let users to
do the roll-ups themselves, but I'm still not fully convinced that we
should just provide single-level metrics, because 1) I think for different
metrics people may be interested to investigate into different
granularities, e.g. for poll / commit rate these are at the lowest
task-level metrics, while for process-rate / skip-rate they can be as low
as processor-node metrics, and 2) user-side rolling ups may not be very
straight-forward. But for 2) if someone can provide an efficient and easy
implementation of that I can be persuaded :)


For now I'm thinking we can add the metric on thread-level, either with
finer grained ones with "reason" tag plus an aggregated one without the
tag, or just having a single aggregated metric without the tag looks good
to me.


Guozhang

On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io> wrote:

> Hey Guozhang,
>
> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a list in
> the KIP. I was also thinking we'd better start a documentation page with
> the metrics listed.
>
> I'd have no problem logging a warning when we skip records. On the metric
> front, really I'm just pushing for us to maintain only one level of
> metrics. If that's more or less granular (i.e., maybe we don't have a
> metric per reason and log the reason instead), that's fine by me. I just
> don't think it provides a lot of extra value per complexity (interface and
> implementation) to maintain roll-ups at the thread level in addition to
> lower-level metrics.
>
> How about this instead:
> * maintain one skipped-record metric (could be per-thread, per-task, or
> per-processor-node) with no "reason"
> * introduce a warn-level log detailing the topic/partition/offset and
> reason of the skipped record
>
> If you like that, I can update the KIP.
>
> Thanks,
> -John
>
>
>
> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > > One thing you mention is the notion of setting alerts on coarser
> metrics
> > being easier than finer ones. All the metric alerting systems I have used
> > make it equally easy to alert on metrics by-tag or over tags. So my
> > experience doesn't say that this is a use case. Were you thinking of an
> > alerting system that makes such a pre-aggregation valuable?
> >
> > For the commonly used JMX reporter tags will be encoded directly as part
> of
> > the object name, and if users wants to monitor them they need to know
> these
> > values before hand. That is also why I think we do want to list all the
> > possible values of the reason tags in the KIP, since
> >
> > > In my email in response to Matthias, I gave an example of the kind of
> > scenario that would lead me as an operator to run with DEBUG on all the
> > time, since I wouldn't be sure, having seen a skipped record once, that
> it
> > would ever happen again. The solution is to capture all the available
> > information about the reason and location of skips all the time.
> >
> > That is a good point. I think we can either expose all levels metrics as
> by
> > default, or only expose the most lower-level metrics and get rid of other
> > levels to let users do roll-ups themselves (which will be a much larger
> > scope for discussion), or we can encourage users to not purely depend on
> > metrics for such trouble shooting: that is to say, users only be alerted
> > based on metrics, and we can log a info / warn log4j entry each time we
> are
> > about to skip a record all over the places, so that upon being notified
> > users can look into the logs to find the details on where / when it
> > happens. WDYT?
> >
> >
> > Guozhang
> >
> >
> >
> > On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io> wrote:
> >
> > > Hey Guozhang,
> > >
> > > Thanks for the review.
> > >
> > > 1.
> > > Matthias raised the same question about the "reason" tag values. I can
> > list
> > > all possible values of the "reason" tag, but I'm thinking this level of
> > > detail may not be KIP-worthy, maybe the code and documentation review
> > would
> > > be sufficient. If you all disagree and would like it included in the
> > KIP, I
> > > can certainly do that.
> > >
> > > If we do provide roll-up metrics, I agree with the pattern of keeping
> the
> > > same name but eliminating the tags for the dimensions that were
> > rolled-up.
> > >
> > > 2.
> > > I'm not too sure that implementation efficiency really becomes a factor
> > in
> > > choosing whether to (by default) update one coarse metric at the thread
> > > level or one granular metric at the processor-node level, since it's
> just
> > > one metric being updated either way. I do agree that if we were to
> update
> > > the granular metrics and multiple roll-ups, then we should consider the
> > > efficiency.
> > >
> > > I agree it's probably not necessary to surface the metrics for all
> nodes
> > > regardless of whether they can or do skip records. Perhaps we can
> lazily
> > > register the metrics.
> > >
> > > In my email in response to Matthias, I gave an example of the kind of
> > > scenario that would lead me as an operator to run with DEBUG on all the
> > > time, since I wouldn't be sure, having seen a skipped record once, that
> > it
> > > would ever happen again. The solution is to capture all the available
> > > information about the reason and location of skips all the time.
> > >
> > >
> > >
> > > One thing you mention is the notion of setting alerts on coarser
> metrics
> > > being easier than finer ones. All the metric alerting systems I have
> used
> > > make it equally easy to alert on metrics by-tag or over tags. So my
> > > experience doesn't say that this is a use case. Were you thinking of an
> > > alerting system that makes such a pre-aggregation valuable?
> > >
> > > Thanks again,
> > > -John
> > >
> > >
> > > On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Hello John,
> > > >
> > > > Thanks for the KIP. Some comments:
> > > >
> > > > 1. Could you list all the possible values of the "reason" tag? In the
> > > JIRA
> > > > ticket I left some potential reasons but I'm not clear if you're
> going
> > to
> > > > categorize each of them as a separate reason, or is there any
> > additional
> > > > ones you have in mind.
> > > >
> > > > Also I'm wondering if we should add another metric that do not have
> the
> > > > reason tag but aggregates among all possible reasons? This is for
> users
> > > to
> > > > easily set their alerting notifications (otherwise they have to write
> > on
> > > > notification rule per reason) in their monitoring systems.
> > > >
> > > > 2. Note that the processor-node metrics is actually "per-thread,
> > > per-task,
> > > > per-processor-node", and today we only set the per-thread metrics as
> > INFO
> > > > while leaving the lower two layers as DEBUG. I agree with your
> argument
> > > > that we are missing the per-client roll-up metrics today, but I'm
> > > convinced
> > > > that the right way to approach it would be
> "just-providing-the-lowest-
> > > > level
> > > > metrics only".
> > > >
> > > > Note the recoding implementation of these three levels are different
> > > > internally today: we did not just do the rolling up to generate the
> > > > higher-level metrics from the lower level ones, but we just record
> them
> > > > separately, which means that, if we turn on multiple levels of
> metrics,
> > > we
> > > > maybe duplicate collecting some metrics. One can argue that is not
> the
> > > best
> > > > way to represent multi-level metrics collecting and reporting, but by
> > > only
> > > > enabling thread-level metrics as INFO today, that implementation
> could
> > be
> > > > more efficient than only collecting the metrics at the lowest level,
> > and
> > > > then do the roll-up calculations outside of the metrics classes.
> > > >
> > > > Plus, today not all processor-nodes may possibly skip records, AFAIK
> we
> > > > will only skip records at the source, sink, window and aggregation
> > > > processor nodes, so adding a metric per processor looks like an
> > overkill
> > > to
> > > > me as well. On the other hand, from user's perspective the "reason"
> tag
> > > may
> > > > be sufficient for them to narrow down where inside the topology is
> > > causing
> > > > records to be dropped on the floor. So I think the "per-thread,
> > per-task"
> > > > level metrics should be sufficient for them in trouble shoot in DEBUG
> > > mode,
> > > > and we can add another "per-thread" level metrics as INFO which is
> > turned
> > > > on by default. So under normal execution users still only need INFO
> > level
> > > > metrics for alerting (e.g. set alerts on all skipped-records metrics
> as
> > > > non-zero), and then upon trouble shooting they can turn on DEBUG
> > metrics
> > > to
> > > > look into which task is actually causing the skipped records.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > matth...@confluent.io>
> > > > wrote:
> > > >
> > > > > Thanks for the KIP John.
> > > > >
> > > > > Reading the material on the related Jiras, I am wondering what
> > `reason`
> > > > > tags you want to introduce? Can you elaborate? The KIP should list
> > > those
> > > > > IMHO.
> > > > >
> > > > > About the fine grained metrics vs the roll-up: you say that
> > > > >
> > > > > > the coarse metric aggregates across two dimensions simultaneously
> > > > >
> > > > > Can you elaborate why this is an issue? I am not convinced atm that
> > we
> > > > > should put the fine grained metrics into INFO level and remove the
> > > > > roll-up at thread level.
> > > > >
> > > > > > Given that they have to do this sum to get a usable top-level
> view
> > > > >
> > > > > This is a fair concern, but I don't share the conclusion. Offering
> a
> > > > > built-in `KafkaStreams` "client" roll-up out of the box might be a
> > > > > better solution. In the past we did not offer this due to
> performance
> > > > > concerns, but we could allow an "opt-in" mechanism. If you
> disagree,
> > > can
> > > > > you provide some reasoning and add them to the "Rejected
> > alternatives"
> > > > > section.
> > > > >
> > > > > To rephrase: I understand the issue about missing top-level view,
> but
> > > > > instead of going more fine grained, we should consider to add this
> > > > > top-level view and add/keep the fine grained metrics at DEBUG level
> > > only
> > > > >
> > > > > I am +1 to add TopologyTestDriver#metrics() and to remove old
> metrics
> > > > > directly as you suggested.
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > >
> > > > > On 3/28/18 6:42 PM, Ted Yu wrote:
> > > > > > Looks good to me.
> > > > > >
> > > > > > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <j...@confluent.io
> >
> > > > wrote:
> > > > > >
> > > > > >> Hello all,
> > > > > >>
> > > > > >> I am proposing KIP-274 to improve the metrics around skipped
> > records
> > > > in
> > > > > >> Streams.
> > > > > >>
> > > > > >> Please find the details here:
> > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > >> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > > > >>
> > > > > >> Please let me know what you think!
> > > > > >>
> > > > > >> Thanks,
> > > > > >> -John
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to