Hi Matthias, Thanks for the review.
Regarding the "reason" tags, I listed "deserialization-error" and "negative-timestamp" to capture the two skips we already account for. More generally, I would like to establish a pattern in which we could add new values for the "reason" tags without needing a KIP to do so. For example, pulling from Guozhang's list in the jira, if we discovered the "null key in aggregation" skip after merging the work for this KIP, instead of creating a new KIP to add a new value for the reason tag, we'd just create a PR like "MINOR: add null key in aggregation to skipped-records metric". In that PR, we would just record the new metric with the same metric name, but something like "null-key-in-aggregation" as the value. That's not to say we wouldn't document the full list of tag values, just that we don't need the KIProcess to propose and ratify the tag name. If you agree with this proposal, then by extension we also do not need to agree on the full list of "reason"s during this KIP discussion and can instead deal with that during code review. Next: > About the fine grained metrics vs the roll-up: you say that > > > the coarse metric aggregates across two dimensions simultaneously > > Can you elaborate why this is an issue? I am not convinced atm that we > should put the fine grained metrics into INFO level and remove the > roll-up at thread level. > Sure. The issue is the assumption that our library's users would prefer to see a skip counter with no context by default and then optionally see the information about the reason and the location in the topology of the skip. I'm not sure I share this assumption. For example, if I run my Streams app happily for a long time, and then see the skip counter go to 1, I can't be sure that turning on DEBUG after the fact will be sufficient to help me figure out what happened. I would have to try to re-wind my application to re-process the same data with DEBUG on to investigate the issue, which could be complex indeed. As a result, I would personally just run with DEBUG on all the time. I guess what I'm saying is that, to me, the reason why my stream processing framework drops some data is important information that we should expose by default. Given access to this data, I don't see why we would also provide a partial roll-up of it at the same level of verbosity. On whether we provide an optional top-level roll-up, the thought did cross my mind that we could register a gauge that recursively aggregates all the granular skip metrics. The gauge wouldn't introduce any contention, since it wouldn't do anything until it's invoked. But we still couldn't aggregate above the level of the process, so I figure it's better to avoid the complexity and let users report and aggregate metrics on their own. I'll update the KIP with my thoughts as you suggest. Thanks again, -John On Thu, Mar 29, 2018 at 4:03 PM, Matthias J. Sax <matth...@confluent.io> wrote: > Thanks for the KIP John. > > Reading the material on the related Jiras, I am wondering what `reason` > tags you want to introduce? Can you elaborate? The KIP should list those > IMHO. > > About the fine grained metrics vs the roll-up: you say that > > > the coarse metric aggregates across two dimensions simultaneously > > Can you elaborate why this is an issue? I am not convinced atm that we > should put the fine grained metrics into INFO level and remove the > roll-up at thread level. > > > Given that they have to do this sum to get a usable top-level view > > This is a fair concern, but I don't share the conclusion. Offering a > built-in `KafkaStreams` "client" roll-up out of the box might be a > better solution. In the past we did not offer this due to performance > concerns, but we could allow an "opt-in" mechanism. If you disagree, can > you provide some reasoning and add them to the "Rejected alternatives" > section. > > To rephrase: I understand the issue about missing top-level view, but > instead of going more fine grained, we should consider to add this > top-level view and add/keep the fine grained metrics at DEBUG level only > > I am +1 to add TopologyTestDriver#metrics() and to remove old metrics > directly as you suggested. > > > -Matthias > > > > On 3/28/18 6:42 PM, Ted Yu wrote: > > Looks good to me. > > > > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <j...@confluent.io> wrote: > > > >> Hello all, > >> > >> I am proposing KIP-274 to improve the metrics around skipped records in > >> Streams. > >> > >> Please find the details here: > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > >> 274%3A+Kafka+Streams+Skipped+Records+Metrics > >> > >> Please let me know what you think! > >> > >> Thanks, > >> -John > >> > > > >