Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Matthias J. Sax Mon, 02 Apr 2018 14:04:02 -0700

John,

sorry for my late reply and thanks for updating the KIP.


I like your approach about "metrics are for monitoring, logs are for
debugging" -- however:

1) I don't see a connection between this and the task-level metrics that
you propose to get the metrics in `TopologyTestDriver`. I don't think
people would monitor the `TopologyTestDriver` an thus wondering why it
is important to include the metrics there? Thread-level metric might be
easier to monitor though (ie, less different metric to monitor).

2) I am a little worried about WARN level logging and that it might be
too chatty -- as you pointed out, it's about debugging, thus DEBUG level
might be better. Not 100% sure about this to be honest. What is the
general assumption about the frequency for skipped records? I could
imagine cases for which skipped records are quite frequent and thus,
WARN level logs might "flood" the logs

One final remark:

> More
> generally, I would like to establish a pattern in which we could add new
> values for the "reason" tags without needing a KIP to do so.

From my understanding, this is not feasible. Changing metrics is always
considered a public API change, and we need a KIP for any change. As we
moved away from tagging, it doesn't matter for the KIP anymore -- just
wanted to point it out.


-Matthias


On 3/30/18 2:47 PM, John Roesler wrote:
> Allrighty! The KIP is updated.
> 
> Thanks again, all, for the feedback.
> -John
> 
> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> wrote:
> 
>> Hey Guozhang and Bill,
>>
>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like to
>> put it in the task instead of the thread so that it'll show up in the
>> TopologyTestDriver metrics as well.
>>
>> I'm leaning toward keeping the scope where it is right now, but if others
>> want to advocate for tossing in some more metrics, we can go that route.
>>
>> Thanks all,
>> -John
>>
>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> wrote:
>>
>>> Thanks for the KIP John, and sorry for the late comments.
>>>
>>> I'm on the fence with providing a single level metrics, but I think we'll
>>> have that discussion outside of this KIP.
>>>
>>>> * maintain one skipped-record metric (could be per-thread, per-task, or
>>>> per-processor-node) with no "reason"
>>>> * introduce a warn-level log detailing the topic/partition/offset and
>>>> reason of the skipped record
>>>
>>> I'm +1 on both of these suggestions.
>>>
>>> Finally, we have had requests in the past for some metrics around when
>>> persistent store removes an expired window.  Would adding that to our
>>> metrics stretch the scope of this KIP too much?
>>>
>>> Thanks again and overall I'm +1 on this KIP
>>>
>>> Bill
>>>
>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wangg...@gmail.com>
>>> wrote:
>>>
>>>> The proposal sounds good to me. About "maintain only one level of
>>> metrics"
>>>> maybe we can discuss about that separately from this KIP since that
>>> would
>>>> be a larger scope of discussion. I agree that if we are going to
>>> maintain
>>>> only one-level metrics it should be lowest level and we would let users
>>> to
>>>> do the roll-ups themselves, but I'm still not fully convinced that we
>>>> should just provide single-level metrics, because 1) I think for
>>> different
>>>> metrics people may be interested to investigate into different
>>>> granularities, e.g. for poll / commit rate these are at the lowest
>>>> task-level metrics, while for process-rate / skip-rate they can be as
>>> low
>>>> as processor-node metrics, and 2) user-side rolling ups may not be very
>>>> straight-forward. But for 2) if someone can provide an efficient and
>>> easy
>>>> implementation of that I can be persuaded :)
>>>>
>>>> For now I'm thinking we can add the metric on thread-level, either with
>>>> finer grained ones with "reason" tag plus an aggregated one without the
>>>> tag, or just having a single aggregated metric without the tag looks
>>> good
>>>> to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io>
>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
>>> list in
>>>>> the KIP. I was also thinking we'd better start a documentation page
>>> with
>>>>> the metrics listed.
>>>>>
>>>>> I'd have no problem logging a warning when we skip records. On the
>>> metric
>>>>> front, really I'm just pushing for us to maintain only one level of
>>>>> metrics. If that's more or less granular (i.e., maybe we don't have a
>>>>> metric per reason and log the reason instead), that's fine by me. I
>>> just
>>>>> don't think it provides a lot of extra value per complexity (interface
>>>> and
>>>>> implementation) to maintain roll-ups at the thread level in addition
>>> to
>>>>> lower-level metrics.
>>>>>
>>>>> How about this instead:
>>>>> * maintain one skipped-record metric (could be per-thread, per-task,
>>> or
>>>>> per-processor-node) with no "reason"
>>>>> * introduce a warn-level log detailing the topic/partition/offset and
>>>>> reason of the skipped record
>>>>>
>>>>> If you like that, I can update the KIP.
>>>>>
>>>>> Thanks,
>>>>> -John
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangg...@gmail.com>
>>>> wrote:
>>>>>
>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>> metrics
>>>>>> being easier than finer ones. All the metric alerting systems I have
>>>> used
>>>>>> make it equally easy to alert on metrics by-tag or over tags. So my
>>>>>> experience doesn't say that this is a use case. Were you thinking
>>> of an
>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>
>>>>>> For the commonly used JMX reporter tags will be encoded directly as
>>>> part
>>>>> of
>>>>>> the object name, and if users wants to monitor them they need to
>>> know
>>>>> these
>>>>>> values before hand. That is also why I think we do want to list all
>>> the
>>>>>> possible values of the reason tags in the KIP, since
>>>>>>
>>>>>>> In my email in response to Matthias, I gave an example of the
>>> kind of
>>>>>> scenario that would lead me as an operator to run with DEBUG on all
>>> the
>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
>>> that
>>>>> it
>>>>>> would ever happen again. The solution is to capture all the
>>> available
>>>>>> information about the reason and location of skips all the time.
>>>>>>
>>>>>> That is a good point. I think we can either expose all levels
>>> metrics
>>>> as
>>>>> by
>>>>>> default, or only expose the most lower-level metrics and get rid of
>>>> other
>>>>>> levels to let users do roll-ups themselves (which will be a much
>>> larger
>>>>>> scope for discussion), or we can encourage users to not purely
>>> depend
>>>> on
>>>>>> metrics for such trouble shooting: that is to say, users only be
>>>> alerted
>>>>>> based on metrics, and we can log a info / warn log4j entry each
>>> time we
>>>>> are
>>>>>> about to skip a record all over the places, so that upon being
>>> notified
>>>>>> users can look into the logs to find the details on where / when it
>>>>>> happens. WDYT?
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io>
>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> Thanks for the review.
>>>>>>>
>>>>>>> 1.
>>>>>>> Matthias raised the same question about the "reason" tag values. I
>>>> can
>>>>>> list
>>>>>>> all possible values of the "reason" tag, but I'm thinking this
>>> level
>>>> of
>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
>>> review
>>>>>> would
>>>>>>> be sufficient. If you all disagree and would like it included in
>>> the
>>>>>> KIP, I
>>>>>>> can certainly do that.
>>>>>>>
>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
>>> keeping
>>>>> the
>>>>>>> same name but eliminating the tags for the dimensions that were
>>>>>> rolled-up.
>>>>>>>
>>>>>>> 2.
>>>>>>> I'm not too sure that implementation efficiency really becomes a
>>>> factor
>>>>>> in
>>>>>>> choosing whether to (by default) update one coarse metric at the
>>>> thread
>>>>>>> level or one granular metric at the processor-node level, since
>>> it's
>>>>> just
>>>>>>> one metric being updated either way. I do agree that if we were to
>>>>> update
>>>>>>> the granular metrics and multiple roll-ups, then we should
>>> consider
>>>> the
>>>>>>> efficiency.
>>>>>>>
>>>>>>> I agree it's probably not necessary to surface the metrics for all
>>>>> nodes
>>>>>>> regardless of whether they can or do skip records. Perhaps we can
>>>>> lazily
>>>>>>> register the metrics.
>>>>>>>
>>>>>>> In my email in response to Matthias, I gave an example of the
>>> kind of
>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>>> all
>>>> the
>>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
>>>> that
>>>>>> it
>>>>>>> would ever happen again. The solution is to capture all the
>>> available
>>>>>>> information about the reason and location of skips all the time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>> metrics
>>>>>>> being easier than finer ones. All the metric alerting systems I
>>> have
>>>>> used
>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
>>> my
>>>>>>> experience doesn't say that this is a use case. Were you thinking
>>> of
>>>> an
>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> -John
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>> wangg...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello John,
>>>>>>>>
>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>
>>>>>>>> 1. Could you list all the possible values of the "reason" tag?
>>> In
>>>> the
>>>>>>> JIRA
>>>>>>>> ticket I left some potential reasons but I'm not clear if you're
>>>>> going
>>>>>> to
>>>>>>>> categorize each of them as a separate reason, or is there any
>>>>>> additional
>>>>>>>> ones you have in mind.
>>>>>>>>
>>>>>>>> Also I'm wondering if we should add another metric that do not
>>> have
>>>>> the
>>>>>>>> reason tag but aggregates among all possible reasons? This is
>>> for
>>>>> users
>>>>>>> to
>>>>>>>> easily set their alerting notifications (otherwise they have to
>>>> write
>>>>>> on
>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>
>>>>>>>> 2. Note that the processor-node metrics is actually "per-thread,
>>>>>>> per-task,
>>>>>>>> per-processor-node", and today we only set the per-thread
>>> metrics
>>>> as
>>>>>> INFO
>>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
>>>>> argument
>>>>>>>> that we are missing the per-client roll-up metrics today, but
>>> I'm
>>>>>>> convinced
>>>>>>>> that the right way to approach it would be
>>>>> "just-providing-the-lowest-
>>>>>>>> level
>>>>>>>> metrics only".
>>>>>>>>
>>>>>>>> Note the recoding implementation of these three levels are
>>>> different
>>>>>>>> internally today: we did not just do the rolling up to generate
>>> the
>>>>>>>> higher-level metrics from the lower level ones, but we just
>>> record
>>>>> them
>>>>>>>> separately, which means that, if we turn on multiple levels of
>>>>> metrics,
>>>>>>> we
>>>>>>>> maybe duplicate collecting some metrics. One can argue that is
>>> not
>>>>> the
>>>>>>> best
>>>>>>>> way to represent multi-level metrics collecting and reporting,
>>> but
>>>> by
>>>>>>> only
>>>>>>>> enabling thread-level metrics as INFO today, that implementation
>>>>> could
>>>>>> be
>>>>>>>> more efficient than only collecting the metrics at the lowest
>>>> level,
>>>>>> and
>>>>>>>> then do the roll-up calculations outside of the metrics classes.
>>>>>>>>
>>>>>>>> Plus, today not all processor-nodes may possibly skip records,
>>>> AFAIK
>>>>> we
>>>>>>>> will only skip records at the source, sink, window and
>>> aggregation
>>>>>>>> processor nodes, so adding a metric per processor looks like an
>>>>>> overkill
>>>>>>> to
>>>>>>>> me as well. On the other hand, from user's perspective the
>>> "reason"
>>>>> tag
>>>>>>> may
>>>>>>>> be sufficient for them to narrow down where inside the topology
>>> is
>>>>>>> causing
>>>>>>>> records to be dropped on the floor. So I think the "per-thread,
>>>>>> per-task"
>>>>>>>> level metrics should be sufficient for them in trouble shoot in
>>>> DEBUG
>>>>>>> mode,
>>>>>>>> and we can add another "per-thread" level metrics as INFO which
>>> is
>>>>>> turned
>>>>>>>> on by default. So under normal execution users still only need
>>> INFO
>>>>>> level
>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
>>>> metrics
>>>>> as
>>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG
>>>>>> metrics
>>>>>>> to
>>>>>>>> look into which task is actually causing the skipped records.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>> matth...@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>
>>>>>>>>> Reading the material on the related Jiras, I am wondering what
>>>>>> `reason`
>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
>>>> list
>>>>>>> those
>>>>>>>>> IMHO.
>>>>>>>>>
>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
>>>>>>>>>
>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>> simultaneously
>>>>>>>>>
>>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
>>>> that
>>>>>> we
>>>>>>>>> should put the fine grained metrics into INFO level and remove
>>>> the
>>>>>>>>> roll-up at thread level.
>>>>>>>>>
>>>>>>>>>> Given that they have to do this sum to get a usable
>>> top-level
>>>>> view
>>>>>>>>>
>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>> Offering
>>>>> a
>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
>>> be
>>>> a
>>>>>>>>> better solution. In the past we did not offer this due to
>>>>> performance
>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>> disagree,
>>>>>>> can
>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>> alternatives"
>>>>>>>>> section.
>>>>>>>>>
>>>>>>>>> To rephrase: I understand the issue about missing top-level
>>> view,
>>>>> but
>>>>>>>>> instead of going more fine grained, we should consider to add
>>>> this
>>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
>>>> level
>>>>>>> only
>>>>>>>>>
>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
>>>>> metrics
>>>>>>>>> directly as you suggested.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>> Looks good to me.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>> j...@confluent.io
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>> skipped
>>>>>> records
>>>>>>>> in
>>>>>>>>>>> Streams.
>>>>>>>>>>>
>>>>>>>>>>> Please find the details here:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>
>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to