Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Bruno Cadonna Mon, 09 Aug 2021 02:19:38 -0700

Hi,

With "group", I actually meant the "type" tag. A thread-level metrics inStreams has the following tag "type=stream-thread-metrics" which we alsocall "group" in the code.I realized now that I mentioned "type" and "group" in my previous e-mailwhich are synonyms for the same concept.

For example, the blocked-time-total metric in Rohan's KIP should bedefined as:


blocked-time-total

tags:
type = stream-thread-metrics
thread-id = [stream thread ID]

level: INFO

description: the total time the Kafka Streams thread spent blocked on Kafka.

Best,
Bruno


On 27.07.21 10:13, Sophie Blee-Goldman wrote:

Thanks for the clarifications, that all makes sense.

I'm ready to vote on the KIP, but can you just update the KIP first to
address Bruno's feedback? Ie just fix the tags and fill in the missing
fields.

For example it sounds like these would be thread-level metrics. You should
be able to figure out what the values should be from the KIP-444 doc,
there's a chart with the type and tags for all thread-level metrics.

Not 100% sure what Bruno meant by "group" but my guess would be whether
it's INFO/DEBUG/TRACE. This is probably one of the most important
things to include in a KIP that's introducing new metrics: how big is the
potential performance impact of recording these metrics? How big is the
intended audience, would these be useful to almost everyone or are they
more "niche"?

It sounds like maybe DEBUG would be most appropriate here -- WDYT?

-Sophie

On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna <cado...@apache.org> wrote:

Hi Rohan,

Thank you for the KIP!

I agree that the KIP is well-motivated.

What is not very clear is the metadata like type, group, and tags of the
metrics. For example, there is not application-id tag in Streams and
there is also no producer-id tag. The clients, i.e., producer, admin,
consumer, and also Streams have a client-id tag, that corresponds to the
producer-id, consumer-id, etc you use in the KIP.

For examples of metadata used in Streams you can look at the following
KIPs:

-

https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
-

https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
-

https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
-

https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams


Best,
Bruno

On 22.07.21 09:42, Rohan Desai wrote:

re sophie:

The intent here was to include all blocked time (not just `RUNNING`). The
caller can window the total blocked time themselves, and that can be
compared with a timeseries of the state to understand the ratio in
different states. I'll update the KIP to include `committed`. The admin

API

calls should be accounted for by the admin client iotime/iowaittime
metrics.

On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

I remember now that we moved the round-trip PID's txn completion logic

into
init-transaction and commit/abort-transaction. So I think we'd count

time

as in StreamsProducer#initTransaction as well (admittedly it is in most
cases a one-time thing).

Makes sense - I'll update the KIP

On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

I had a question - it seems like from the descriptionsof

`txn-commit-time-total` and `offset-commit-time-total` that they

measure

similar processes for ALOS and EOS, but only `txn-commit-time-total` is
included in `blocked-time-total`. Why isn't `offset-commit-time-total`

also

included?

I've updated the KIP to include it.

Aside from `flush-time-total`, `txn-commit-time-total` and

`offset-commit-time-total`, which will be producer/consumer client
metrics,
the rest of the metrics will be streams metrics that will be thread

level,

is that right?

Based on the feedback from Guozhang, I've updated the KIP to reflect

that

the lower-level metrics are all client metrics that are then summed to
compute the blocked time metric, which is a Streams metric.

On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

Similarly, I think "txn-commit-time-total" and

"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

I agree for offset-commit-time-total. For txn-commit-time-total I'm
proposing we measure `StreamsProducer.commitTransaction`, which wraps
multiple producer calls (sendOffsets, commitTransaction)

For "txn-commit-time-total" specifically, besides

producer.commitTxn.
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes,
could
you update the KIP page accordingly.

`beginTransaction` is not blocking - I meant to remove that from that
doc. I'll add something for abort.

On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.ro...@gmail.com

wrote:

Thanks for the review Guozhang! responding to your feedback inline:

1) I agree that the current ratio metrics is just "snapshot in

point", and
more flexible metrics that would allow reporters to calculate based

on

window intervals are better. However, the current mechanism of the
proposed
metrics assumes the thread->clients mapping as of today, where each
thread
would own exclusively one main consumer, restore consumer, producer

and

an
admin client. But this mapping may be subject to change in the

future.

Have
you thought about how this metric can be extended when, e.g. the
embedded
clients and stream threads are de-coupled?

Of course this depends on how exactly we refactor the runtime -
assuming that we plan to factor out consumers into an "I/O" layer

that is

responsible for receiving records and enqueuing them to be processed

by

processing threads, then I think it should be reasonable to count

the time

we spend blocked on this internal queue(s) as blocked. The main

concern

there to me is that the I/O layer would be doing something expensive

like

decompression that shouldn't be counted as "blocked". But if that

really is

so expensive that it starts to throw off our ratios then it's

probably

indicative of a larger problem that the "i/o layer" is a bottleneck

and it

would be worth refactoring so that decompression (or insert other

expensive

thing here) can also be done on the processing threads.

2) [This and all below are minor comments] The "flush-time-total"

may

better be a producer client metric, as "flush-wait-time-total", than

streams metric, though the streams-level "total-blocked" can still
leverage
it. Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

Good call - I'll update the KIP

3) The doc was not very clear on how "thread-start-time" would be

needed
when calculating streams utilization along with total-blocked time,
could
you elaborate a bit more in the KIP?

Yes, will do.

For "txn-commit-time-total" specifically, besides

producer.commitTxn.

other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned

"txn-begin-time-total"

later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes,
could
you update the KIP page accordingly.

Ack.

On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <

desai.p.ro...@gmail.com>

wrote:

Hello All,

I'd like to start a discussion on the KIP linked above which

proposes

some metrics that we would find useful to help measure whether a

Kafka

Streams application is saturated. The motivation section in the KIP

goes

into some more detail on why we think this is a useful addition to

the

metrics already implemented. Thanks in advance for your feedback!

Best Regards,

Rohan

On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <

desai.p.ro...@gmail.com>

wrote:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Reply via email to