Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Matthias J. Sax Wed, 04 Apr 2018 17:56:15 -0700

Thanks for updating the KIP. LGTM.

-Matthias



On 4/4/18 12:46 PM, John Roesler wrote:
> Cool, if you're satisfied with the KIP now, maybe I can lobby for your vote
> ;)
> 
> The vote thread is still at only one binding +1, I think.
> 
> Thanks,
> -John
> 
> On Tue, Apr 3, 2018 at 9:05 PM, Matthias J. Sax <[email protected]>
> wrote:
> 
>> Sounds great!
>>
>> The cryptic topic names can be an issue -- however, people can
>> `describe()`  their topology to map the name to the corresponding
>> sub-topology/tasks to narrow the error down to the corresponding
>> operators. I think, this should be "sufficient for now" for debugging.
>>
>> Renaming those topic seems to be out-of-scope for this KIP.
>>
>>
>> -Matthias
>>
>> On 4/3/18 2:45 PM, Guozhang Wang wrote:
>>> Thanks John, your proposal looks fine to me.
>>>
>>> I'll go ahead and look into the PR for more details myself.
>>>
>>>
>>> Guozhang
>>>
>>> On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <[email protected]> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks for making the updates.
>>>>
>>>> I agree with the information you've included in the logs as described
>>>> above, as log statements without enough context/information can be
>>>> frustrating.
>>>>
>>>> -Bill
>>>>
>>>> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <[email protected]> wrote:
>>>>
>>>>> Allrighty, how about this, then...
>>>>>
>>>>> I'll move the metric back to the StreamThread and maintain the existing
>>>> tag
>>>>> (client-id=...(per-thread client-id)). It won't be present in the
>>>>> TopologyTestDriver's metrics.
>>>>>
>>>>> As a side note, I'm not sure that the location of the log messages has
>>>>> visibility into the name of the thread or the task, or the processor
>>>> node,
>>>>> for that matter. But at the end of the day, I don't think it really
>>>>> matters.
>>>>>
>>>>> None of those identifiers are in the public interface or
>> user-controlled.
>>>>> For them to be useful for debugging, users would have to gain a very
>> deep
>>>>> understanding of how their DSL program gets executed. From my
>>>> perspective,
>>>>> they are all included in metric tags only to prevent collisions between
>>>> the
>>>>> same metrics in different (e.g.) threads.
>>>>>
>>>>> I think what's important is to provide the right information in the
>> logs
>>>>> that users will be able to debug their issues. This is why the logs in
>> my
>>>>> pr include the topic/partition/offset of the offending data, as well as
>>>> the
>>>>> stacktrace of the exception from the deserializer (or for timestamps,
>> the
>>>>> extracted timestamp and the class name of their extractor). This
>>>>> information alone should let them pinpoint the offending data and fix
>> it.
>>>>>
>>>>> (I am aware that that topic name might be a repartition topic, and
>>>>> therefore also esoteric from the user's perspective, but I think it's
>> the
>>>>> best we can do right now. It might be nice to explicitly take on a
>>>>> debugging ergonomics task in the future and give all processor nodes
>>>>> human-friendly names. Then, we could surface these names in any logs or
>>>>> exceptions. But I'm inclined to call this out-of-scope for now.)
>>>>>
>>>>> Thanks again,
>>>>> -John
>>>>>
>>>>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <[email protected]>
>>>> wrote:
>>>>>
>>>>>> 1. If we can indeed gather all the context information from the log4j
>>>>>> entries I'd suggest we change to thread-level (I'm not sure if that is
>>>>>> doable, so if John have already some WIP PR that can help us decide).
>>>>>>
>>>>>> 2. We can consider adding the API in TopologyTestDriver for general
>>>>> testing
>>>>>> purposes; that being said, I think Matthias has a good point that this
>>>>>> alone should not be a driving motivation for us to keep this metric as
>>>>>> task-level if 1) is true.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Guozhang, that was my intent.
>>>>>>>
>>>>>>> @John: yes, we should not nail down the exact log message. It's just
>>>> to
>>>>>>> point out the trade-off. If we can get the required information in
>>>> the
>>>>>>> logs, we might not need task level metrics.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote:
>>>>>>>> I think Matthias' comment is that, we can still record the metrics
>>>> on
>>>>>> the
>>>>>>>> thread-level, while having the WARN log entry to include sufficient
>>>>>>> context
>>>>>>>> information so that users can still easily narrow down the
>>>>>> investigation
>>>>>>>> scope.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <[email protected]>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I agree we should add as much information as is reasonable to the
>>>>> log.
>>>>>>> For
>>>>>>>>> example, see this WIP PR I started for this KIP:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>>>>>>>
>>>>>>>>> I'm not sure if we should nail down the log messages in the KIP or
>>>>> in
>>>>>>> the
>>>>>>>>> PR discussion. What say you?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
>>>>>> [email protected]
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am
>>>>> not
>>>>>>>>>> sure about the right log level either. Your arguments are
>>>>> convincing
>>>>>> --
>>>>>>>>>> thus, I am fine with keeping WARN level.
>>>>>>>>>>
>>>>>>>>>> The task vs thread level argument is an interesting one.
>>>> However, I
>>>>>> am
>>>>>>>>>> wondering if we should add this information into the
>>>> corresponding
>>>>>> WARN
>>>>>>>>>> logs that we write anyway? For this case, we can also log the
>>>>>>>>>> corresponding operator (and other information like topic name etc
>>>>> if
>>>>>>>>>> needed). WDYT about this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level
>>>>> since
>>>>>>>>>> skipped
>>>>>>>>>>> records are not expected in normal execution (for all reasons
>>>> that
>>>>>> we
>>>>>>>>> are
>>>>>>>>>>> aware of), and hence when error happens users should be alerted
>>>>> from
>>>>>>>>>>> metrics and looked into the log files, so to me if it is really
>>>>>>>>> spamming
>>>>>>>>>>> the log files it is also a good alert for users. Besides for
>>>>>>>>> deserialize
>>>>>>>>>>> errors we already log at WARN level for this reason.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the metrics-levels: I was pondering on that as well.
>>>>> What
>>>>>>>>> made
>>>>>>>>>> me
>>>>>>>>>>> to think and agree on task-level than thread-level is that for
>>>>> some
>>>>>>>>>> reasons
>>>>>>>>>>> like window retention, they may possibly be happening on a
>>>> subset
>>>>> of
>>>>>>>>>> input
>>>>>>>>>>> partitions, and tasks are correlated with partitions the
>>>>> task-level
>>>>>>>>>> metrics
>>>>>>>>>>> can help users to narrow down on the specific input data
>>>>> partitions.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <[email protected]
>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>
>>>>>>>>>>>> No worries! Thanks for the reply.
>>>>>>>>>>>>
>>>>>>>>>>>> 1) There isn't a connection. I tried using the
>>>> TopologyTestDriver
>>>>>> to
>>>>>>>>>> write
>>>>>>>>>>>> a quick test exercising the current behavior and discovered
>>>> that
>>>>>> the
>>>>>>>>>>>> metrics weren't available. It seemed like they should be, so I
>>>>>> tacked
>>>>>>>>>> it on
>>>>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back
>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I was also concerned about that, but I figured it would come
>>>>> up
>>>>>> in
>>>>>>>>>>>> discussion if I just went ahead and proposed it. And here we
>>>> are!
>>>>>>>>>>>>
>>>>>>>>>>>> Here's my thought: maybe there are two classes of skips:
>>>>>> "controlled"
>>>>>>>>>> and
>>>>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>>>>>>>> deliberately
>>>>>>>>>>>> filter out some events, and "uncontrolled" means that I simply
>>>>>> don't
>>>>>>>>>>>> account for some feature of the data, and the framework skips
>>>>> them
>>>>>>> (as
>>>>>>>>>>>> opposed to crashing).
>>>>>>>>>>>>
>>>>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>>>>>>>> uncontrolled
>>>>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our
>>>>>> skips
>>>>>>>>> are
>>>>>>>>>>>> well documented, so it wouldn't be terrible to have an
>>>>> application
>>>>>> in
>>>>>>>>>> which
>>>>>>>>>>>> you know you expect to have tons of uncontrolled skips, but
>>>> it's
>>>>>> not
>>>>>>>>>> great
>>>>>>>>>>>> either, since you may also have some *unexpected* uncontrolled
>>>>>> skips.
>>>>>>>>>> It'll
>>>>>>>>>>>> be difficult to notice, since you're probably not alerting on
>>>> the
>>>>>>>>> metric
>>>>>>>>>>>> and filtering out the logs (whatever their level).
>>>>>>>>>>>>
>>>>>>>>>>>> I'd recommend any app author, as an alternative, to convert all
>>>>>>>>> expected
>>>>>>>>>>>> skips to controlled ones, by updating the topology to filter
>>>>> those
>>>>>>>>>> records
>>>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>> Following from my recommendation, as a library author, I'm
>>>>> inclined
>>>>>>> to
>>>>>>>>>> mark
>>>>>>>>>>>> those logs WARN, since in my opinion, they should be concerning
>>>>> to
>>>>>>> the
>>>>>>>>>> app
>>>>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>>>>>>>> default, so
>>>>>>>>>>>> I would pick INFO at least.
>>>>>>>>>>>>
>>>>>>>>>>>> That said, logging is always a tricky issue for lower-level
>>>>>> libraries
>>>>>>>>>> that
>>>>>>>>>>>> run inside user code, since we don't have all the information
>>>> we
>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>> make the right call.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as
>>>>>> well.
>>>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> John,
>>>>>>>>>>>>>
>>>>>>>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like your approach about "metrics are for monitoring, logs
>>>> are
>>>>>> for
>>>>>>>>>>>>> debugging" -- however:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) I don't see a connection between this and the task-level
>>>>>> metrics
>>>>>>>>>> that
>>>>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I
>>>> don't
>>>>>>> think
>>>>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus
>>>> wondering
>>>>>> why
>>>>>>>>> it
>>>>>>>>>>>>> is important to include the metrics there? Thread-level metric
>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> easier to monitor though (ie, less different metric to
>>>> monitor).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) I am a little worried about WARN level logging and that it
>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus
>>>>> DEBUG
>>>>>>>>>> level
>>>>>>>>>>>>> might be better. Not 100% sure about this to be honest. What
>>>> is
>>>>>> the
>>>>>>>>>>>>> general assumption about the frequency for skipped records? I
>>>>>> could
>>>>>>>>>>>>> imagine cases for which skipped records are quite frequent and
>>>>>> thus,
>>>>>>>>>>>>> WARN level logs might "flood" the logs
>>>>>>>>>>>>>
>>>>>>>>>>>>> One final remark:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More
>>>>>>>>>>>>>> generally, I would like to establish a pattern in which we
>>>>> could
>>>>>>> add
>>>>>>>>>>>> new
>>>>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>>>>>>>
>>>>>>>>>>>>> From my understanding, this is not feasible. Changing metrics
>>>> is
>>>>>>>>> always
>>>>>>>>>>>>> considered a public API change, and we need a KIP for any
>>>>> change.
>>>>>> As
>>>>>>>>> we
>>>>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore
>>>>> --
>>>>>>>>> just
>>>>>>>>>>>>> wanted to point it out.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
>>>>> [email protected]
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
>>>> consensus,
>>>>>> I'd
>>>>>>>>>> like
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> put it in the task instead of the thread so that it'll show
>>>> up
>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now,
>>>>> but
>>>>>> if
>>>>>>>>>>>>> others
>>>>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go
>>>>>> that
>>>>>>>>>>>> route.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks all,
>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
>>>>> [email protected]
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm on the fence with providing a single level metrics,
>>>> but I
>>>>>>>>> think
>>>>>>>>>>>>> we'll
>>>>>>>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>>>>>> per-task,
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>> topic/partition/offset
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics
>>>>>> around
>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> persistent store removes an expired window.  Would adding
>>>>> that
>>>>>> to
>>>>>>>>>> our
>>>>>>>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one
>>>>> level
>>>>>>> of
>>>>>>>>>>>>>>>> metrics"
>>>>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP
>>>>> since
>>>>>>>>> that
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are
>>>>> going
>>>>>> to
>>>>>>>>>>>>>>>> maintain
>>>>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we
>>>>> would
>>>>>>> let
>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully
>>>>> convinced
>>>>>>>>> that
>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I
>>>> think
>>>>>> for
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> metrics people may be interested to investigate into
>>>>> different
>>>>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at
>>>> the
>>>>>>>>> lowest
>>>>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate
>>>> they
>>>>>> can
>>>>>>>>> be
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> low
>>>>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
>>>> may
>>>>>> not
>>>>>>>>> be
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>>>>>> efficient
>>>>>>>>>> and
>>>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For now I'm thinking we can add the metric on
>>>> thread-level,
>>>>>>>>> either
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
>>>> one
>>>>>>>>> without
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the
>>>>> tag
>>>>>>>>>> looks
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>>>>>> provide
>>>>>>>>> a
>>>>>>>>>>>>>>>> list in
>>>>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a
>>>>>> documentation
>>>>>>>>>> page
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip
>>>> records.
>>>>>> On
>>>>>>>>> the
>>>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only
>>>> one
>>>>>>> level
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
>>>>>> don't
>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's
>>>> fine
>>>>> by
>>>>>>>>> me.
>>>>>>>>>> I
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> don't think it provides a lot of extra value per
>>>> complexity
>>>>>>>>>>>>> (interface
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level
>>>> in
>>>>>>>>>>>> addition
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be
>>>> per-thread,
>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>>>> topic/partition/offset
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>>>>>>>> [email protected]
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>>>> on
>>>>>>>>> coarser
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>>>> systems
>>>>>>> I
>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>>>> tags.
>>>>>>>>> So
>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>>>>>>> thinking
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>>>> valuable?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>>>>>>>> directly
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they
>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want
>>>> to
>>>>>>> list
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>>>> of
>>>>>> the
>>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>>> DEBUG
>>>>>>> on
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>>> record
>>>>>>>>>> once,
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>>>> the
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>>> the
>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all
>>>>>> levels
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and
>>>>> get
>>>>>>>>> rid
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will
>>>> be
>>>>> a
>>>>>>>>> much
>>>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>>>>>> purely
>>>>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
>>>>>> only
>>>>>>>>> be
>>>>>>>>>>>>>>>>> alerted
>>>>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
>>>> entry
>>>>>>> each
>>>>>>>>>>>>>>>> time we
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>>>>>> being
>>>>>>>>>>>>>>>> notified
>>>>>>>>>>>>>>>>>>> users can look into the logs to find the details on
>>>> where
>>>>> /
>>>>>>>>> when
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>>>>>>>> [email protected]
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason"
>>>> tag
>>>>>>>>> values.
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
>>>> thinking
>>>>>>> this
>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>>>>>>> documentation
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>>>>>> included
>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
>>>>> pattern
>>>>>> of
>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions
>>>>> that
>>>>>>>>> were
>>>>>>>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>>>>>>>> becomes a
>>>>>>>>>>>>>>>>> factor
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse
>>>> metric
>>>>>> at
>>>>>>>>> the
>>>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node
>>>> level,
>>>>>>>>> since
>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if
>>>>> we
>>>>>>>>> were
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
>>>>> should
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the
>>>>> metrics
>>>>>>> for
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records.
>>>>> Perhaps
>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> lazily
>>>>>>>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>>>> of
>>>>>> the
>>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>>>> DEBUG
>>>>>>>>> on
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>>>> record
>>>>>>>>>>>> once,
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>>>> the
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>>>> the
>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>>>> on
>>>>>>>>> coarser
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>>>> systems
>>>>>>>>> I
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>>>> tags.
>>>>>>>>> So
>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were
>>>> you
>>>>>>>>>> thinking
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>>>> valuable?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the
>>>>> "reason"
>>>>>>>>> tag?
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
>>>>> if
>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is
>>>>> there
>>>>>>> any
>>>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric
>>>> that
>>>>> do
>>>>>>>>> not
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
>>>>> This
>>>>>>> is
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise
>>>> they
>>>>>> have
>>>>>>>>> to
>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring
>>>>> systems.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the
>>>>> per-thread
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
>>>>> with
>>>>>>>>> your
>>>>>>>>>>>>>>>>>> argument
>>>>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
>>>>> today,
>>>>>>> but
>>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels
>>>>> are
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>>>>>>>> generate
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
>>>>>> just
>>>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple
>>>>>> levels
>>>>>>>>> of
>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
>>>>>> that
>>>>>>>>> is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>>>>>>>> reporting,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>>>>>>> lowest
>>>>>>>>>>>>>>>>> level,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the
>>>> metrics
>>>>>>>>>> classes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>>>>>>>> records,
>>>>>>>>>>>>>>>>> AFAIK
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor
>>>> looks
>>>>>> like
>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
>>>>> the
>>>>>>>>>>>>>>>> "reason"
>>>>>>>>>>>>>>>>>> tag
>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>>>>>>>> topology
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>>>>>> shoot
>>>>>>>>> in
>>>>>>>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
>>>>> INFO
>>>>>>>>> which
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still
>>>>> only
>>>>>>>>> need
>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>>>>>> skipped-records
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
>>>> turn
>>>>> on
>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>>>>>>> records.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
>>>>> wondering
>>>>>>>>> what
>>>>>>>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
>>>> KIP
>>>>>>>>> should
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
>>>> say
>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
>>>>>> convinced
>>>>>>>>> atm
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
>>>> and
>>>>>>>>> remove
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the
>>>>> conclusion.
>>>>>>>>>>>>>>>>> Offering
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
>>>> box
>>>>>>>>> might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this
>>>> due
>>>>> to
>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
>>>> If
>>>>>> you
>>>>>>>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the
>>>>> "Rejected
>>>>>>>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>>>>>> top-level
>>>>>>>>>>>>>>>> view,
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should
>>>> consider
>>>>> to
>>>>>>>>> add
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
>>>> at
>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
>>>>> remove
>>>>>>>>> old
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
>>>> around
>>>>>>>>>>>>>>>> skipped
>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>>>>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to