Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

John Roesler Thu, 29 Mar 2018 15:26:51 -0700

Hi Matthias,

Thanks for the review.

Regarding the "reason" tags, I listed "deserialization-error" and
"negative-timestamp" to capture the two skips we already account for. More
generally, I would like to establish a pattern in which we could add new
values for the "reason" tags without needing a KIP to do so.

For example, pulling from Guozhang's list in the jira, if we discovered the
"null key in aggregation" skip after merging the work for this KIP, instead
of creating a new KIP to add a new value for the reason tag, we'd just
create a PR like "MINOR: add null key in aggregation to skipped-records
metric". In that PR, we would just record the new metric with the same
metric name, but something like "null-key-in-aggregation" as the value.
That's not to say we wouldn't document the full list of tag values, just
that we don't need the KIProcess to propose and ratify the tag name.

If you agree with this proposal, then by extension we also do not need to
agree on the full list of "reason"s during this KIP discussion and can
instead deal with that during code review.

Next:

> About the fine grained metrics vs the roll-up: you say that
>
> > the coarse metric aggregates across two dimensions simultaneously
>
> Can you elaborate why this is an issue? I am not convinced atm that we
> should put the fine grained metrics into INFO level and remove the
> roll-up at thread level.
>

Sure. The issue is the assumption that our library's users would prefer to
see a skip counter with no context by default and then optionally see the
information about the reason and the location in the topology of the skip.
I'm not sure I share this assumption. For example, if I run my Streams app
happily for a long time, and then see the skip counter go to 1, I can't be
sure that turning on DEBUG after the fact will be sufficient to help me
figure out what happened. I would have to try to re-wind my application to
re-process the same data with DEBUG on to investigate the issue, which
could be complex indeed. As a result, I would personally just run with
DEBUG on all the time.

I guess what I'm saying is that, to me, the reason why my stream processing
framework drops some data is important information that we should expose by
default. Given access to this data, I don't see why we would also provide a
partial roll-up of it at the same level of verbosity.

On whether we provide an optional top-level roll-up, the thought did cross
my mind that we could register a gauge that recursively aggregates all the
granular skip metrics. The gauge wouldn't introduce any contention, since
it wouldn't do anything until it's invoked. But we still couldn't aggregate
above the level of the process, so I figure it's better to avoid the
complexity and let users report and aggregate metrics on their own.

I'll update the KIP with my thoughts as you suggest.

Thanks again,
-John

On Thu, Mar 29, 2018 at 4:03 PM, Matthias J. Sax <matth...@confluent.io>
wrote:

> Thanks for the KIP John.
>
> Reading the material on the related Jiras, I am wondering what `reason`
> tags you want to introduce? Can you elaborate? The KIP should list those
> IMHO.
>
> About the fine grained metrics vs the roll-up: you say that
>
> > the coarse metric aggregates across two dimensions simultaneously
>
> Can you elaborate why this is an issue? I am not convinced atm that we
> should put the fine grained metrics into INFO level and remove the
> roll-up at thread level.
>
> > Given that they have to do this sum to get a usable top-level view
>
> This is a fair concern, but I don't share the conclusion. Offering a
> built-in `KafkaStreams` "client" roll-up out of the box might be a
> better solution. In the past we did not offer this due to performance
> concerns, but we could allow an "opt-in" mechanism. If you disagree, can
> you provide some reasoning and add them to the "Rejected alternatives"
> section.
>
> To rephrase: I understand the issue about missing top-level view, but
> instead of going more fine grained, we should consider to add this
> top-level view and add/keep the fine grained metrics at DEBUG level only
>
> I am +1 to add TopologyTestDriver#metrics() and to remove old metrics
> directly as you suggested.
>
>
> -Matthias
>
>
>
> On 3/28/18 6:42 PM, Ted Yu wrote:
> > Looks good to me.
> >
> > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <j...@confluent.io> wrote:
> >
> >> Hello all,
> >>
> >> I am proposing KIP-274 to improve the metrics around skipped records in
> >> Streams.
> >>
> >> Please find the details here:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>
> >> Please let me know what you think!
> >>
> >> Thanks,
> >> -John
> >>
> >
>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to