Hey Guozhang, Thanks for the review.
1. Matthias raised the same question about the "reason" tag values. I can list all possible values of the "reason" tag, but I'm thinking this level of detail may not be KIP-worthy, maybe the code and documentation review would be sufficient. If you all disagree and would like it included in the KIP, I can certainly do that. If we do provide roll-up metrics, I agree with the pattern of keeping the same name but eliminating the tags for the dimensions that were rolled-up. 2. I'm not too sure that implementation efficiency really becomes a factor in choosing whether to (by default) update one coarse metric at the thread level or one granular metric at the processor-node level, since it's just one metric being updated either way. I do agree that if we were to update the granular metrics and multiple roll-ups, then we should consider the efficiency. I agree it's probably not necessary to surface the metrics for all nodes regardless of whether they can or do skip records. Perhaps we can lazily register the metrics. In my email in response to Matthias, I gave an example of the kind of scenario that would lead me as an operator to run with DEBUG on all the time, since I wouldn't be sure, having seen a skipped record once, that it would ever happen again. The solution is to capture all the available information about the reason and location of skips all the time. One thing you mention is the notion of setting alerts on coarser metrics being easier than finer ones. All the metric alerting systems I have used make it equally easy to alert on metrics by-tag or over tags. So my experience doesn't say that this is a use case. Were you thinking of an alerting system that makes such a pre-aggregation valuable? Thanks again, -John On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <wangg...@gmail.com> wrote: > Hello John, > > Thanks for the KIP. Some comments: > > 1. Could you list all the possible values of the "reason" tag? In the JIRA > ticket I left some potential reasons but I'm not clear if you're going to > categorize each of them as a separate reason, or is there any additional > ones you have in mind. > > Also I'm wondering if we should add another metric that do not have the > reason tag but aggregates among all possible reasons? This is for users to > easily set their alerting notifications (otherwise they have to write on > notification rule per reason) in their monitoring systems. > > 2. Note that the processor-node metrics is actually "per-thread, per-task, > per-processor-node", and today we only set the per-thread metrics as INFO > while leaving the lower two layers as DEBUG. I agree with your argument > that we are missing the per-client roll-up metrics today, but I'm convinced > that the right way to approach it would be "just-providing-the-lowest- > level > metrics only". > > Note the recoding implementation of these three levels are different > internally today: we did not just do the rolling up to generate the > higher-level metrics from the lower level ones, but we just record them > separately, which means that, if we turn on multiple levels of metrics, we > maybe duplicate collecting some metrics. One can argue that is not the best > way to represent multi-level metrics collecting and reporting, but by only > enabling thread-level metrics as INFO today, that implementation could be > more efficient than only collecting the metrics at the lowest level, and > then do the roll-up calculations outside of the metrics classes. > > Plus, today not all processor-nodes may possibly skip records, AFAIK we > will only skip records at the source, sink, window and aggregation > processor nodes, so adding a metric per processor looks like an overkill to > me as well. On the other hand, from user's perspective the "reason" tag may > be sufficient for them to narrow down where inside the topology is causing > records to be dropped on the floor. So I think the "per-thread, per-task" > level metrics should be sufficient for them in trouble shoot in DEBUG mode, > and we can add another "per-thread" level metrics as INFO which is turned > on by default. So under normal execution users still only need INFO level > metrics for alerting (e.g. set alerts on all skipped-records metrics as > non-zero), and then upon trouble shooting they can turn on DEBUG metrics to > look into which task is actually causing the skipped records. > > > Guozhang > > > > > On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <matth...@confluent.io> > wrote: > > > Thanks for the KIP John. > > > > Reading the material on the related Jiras, I am wondering what `reason` > > tags you want to introduce? Can you elaborate? The KIP should list those > > IMHO. > > > > About the fine grained metrics vs the roll-up: you say that > > > > > the coarse metric aggregates across two dimensions simultaneously > > > > Can you elaborate why this is an issue? I am not convinced atm that we > > should put the fine grained metrics into INFO level and remove the > > roll-up at thread level. > > > > > Given that they have to do this sum to get a usable top-level view > > > > This is a fair concern, but I don't share the conclusion. Offering a > > built-in `KafkaStreams` "client" roll-up out of the box might be a > > better solution. In the past we did not offer this due to performance > > concerns, but we could allow an "opt-in" mechanism. If you disagree, can > > you provide some reasoning and add them to the "Rejected alternatives" > > section. > > > > To rephrase: I understand the issue about missing top-level view, but > > instead of going more fine grained, we should consider to add this > > top-level view and add/keep the fine grained metrics at DEBUG level only > > > > I am +1 to add TopologyTestDriver#metrics() and to remove old metrics > > directly as you suggested. > > > > > > -Matthias > > > > > > > > On 3/28/18 6:42 PM, Ted Yu wrote: > > > Looks good to me. > > > > > > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <j...@confluent.io> > wrote: > > > > > >> Hello all, > > >> > > >> I am proposing KIP-274 to improve the metrics around skipped records > in > > >> Streams. > > >> > > >> Please find the details here: > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > >> 274%3A+Kafka+Streams+Skipped+Records+Metrics > > >> > > >> Please let me know what you think! > > >> > > >> Thanks, > > >> -John > > >> > > > > > > > > > > -- > -- Guozhang >