Hey Rohan,

The current metrics proposed in the KIP all LGTM, but if the goal is to
include *all* time spent
blocking on any client API then there are a few that might need to be added
to this list. For
example Consumer#committed is called in StreamTask to get the offsets
during initialization,
and several blocking AdminClient APIs are used by Streams such as
#listOffsets (as well as
 #createTopics and #describeTopics, but only during a rebalance to
create/validate the internal
topics)

One thing that wasn't quite clear to me was whether this metric is intended
to reflect only the
time spent blocking on Kafka *while the StreamThread is in RUNNING*, or
just the total time
spent inside any Kafka client API after the thread has started up. If we
only care about the
former, then the current proposal is sufficient, but if we're actually
targeting the latter then
we need to consider those additional APIs listed above.

Personally, I feel that both could actually be useful -- ie, we expose one
metric that only
corresponds to time we spend blocked on Kafka that could otherwise be spent
processing
records from active tasks, and a second metric that reports all time spent
blocking on Kafka,
whether the thread was still initializing (ie in the PARTITIONS_ASSIGNED
state) or actively
processing (ie in RUNNING). WDYT?


On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

> > I remember now that we moved the round-trip PID's txn completion logic
> into
> init-transaction and commit/abort-transaction. So I think we'd count time
> as in StreamsProducer#initTransaction as well (admittedly it is in most
> cases a one-time thing).
>
> Makes sense - I'll update the KIP
>
> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <desai.p.ro...@gmail.com>
> wrote:
>
> >
> > > I had a question - it seems like from the descriptionsof
> > `txn-commit-time-total` and `offset-commit-time-total` that they measure
> > similar processes for ALOS and EOS, but only `txn-commit-time-total` is
> > included in `blocked-time-total`. Why isn't `offset-commit-time-total`
> also
> > included?
> >
> > I've updated the KIP to include it.
> >
> > > Aside from `flush-time-total`, `txn-commit-time-total` and
> > `offset-commit-time-total`, which will be producer/consumer client
> metrics,
> > the rest of the metrics will be streams metrics that will be thread
> level,
> > is that right?
> >
> > Based on the feedback from Guozhang, I've updated the KIP to reflect that
> > the lower-level metrics are all client metrics that are then summed to
> > compute the blocked time metric, which is a Streams metric.
> >
> > On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <desai.p.ro...@gmail.com>
> > wrote:
> >
> >> > Similarly, I think "txn-commit-time-total" and
> >> "offset-commit-time-total" may better be inside producer and consumer
> >> clients respectively.
> >>
> >> I agree for offset-commit-time-total. For txn-commit-time-total I'm
> >> proposing we measure `StreamsProducer.commitTransaction`, which wraps
> >> multiple producer calls (sendOffsets, commitTransaction)
> >>
> >> > > For "txn-commit-time-total" specifically, besides
> producer.commitTxn.
> >> other txn-related calls may also be blocking, including
> >> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> >> later in the doc, but did not include it as a separate metric, and
> >> similarly, should we have a `txn-abort-time-total` as well? If yes,
> could
> >> you update the KIP page accordingly.
> >>
> >> `beginTransaction` is not blocking - I meant to remove that from that
> >> doc. I'll add something for abort.
> >>
> >> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.ro...@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the review Guozhang! responding to your feedback inline:
> >>>
> >>> > 1) I agree that the current ratio metrics is just "snapshot in
> >>> point", and
> >>> more flexible metrics that would allow reporters to calculate based on
> >>> window intervals are better. However, the current mechanism of the
> >>> proposed
> >>> metrics assumes the thread->clients mapping as of today, where each
> >>> thread
> >>> would own exclusively one main consumer, restore consumer, producer and
> >>> an
> >>> admin client. But this mapping may be subject to change in the future.
> >>> Have
> >>> you thought about how this metric can be extended when, e.g. the
> embedded
> >>> clients and stream threads are de-coupled?
> >>>
> >>> Of course this depends on how exactly we refactor the runtime -
> assuming
> >>> that we plan to factor out consumers into an "I/O" layer that is
> >>> responsible for receiving records and enqueuing them to be processed by
> >>> processing threads, then I think it should be reasonable to count the
> time
> >>> we spend blocked on this internal queue(s) as blocked. The main concern
> >>> there to me is that the I/O layer would be doing something expensive
> like
> >>> decompression that shouldn't be counted as "blocked". But if that
> really is
> >>> so expensive that it starts to throw off our ratios then it's probably
> >>> indicative of a larger problem that the "i/o layer" is a bottleneck
> and it
> >>> would be worth refactoring so that decompression (or insert other
> expensive
> >>> thing here) can also be done on the processing threads.
> >>>
> >>> > 2) [This and all below are minor comments] The "flush-time-total" may
> >>> better be a producer client metric, as "flush-wait-time-total", than a
> >>> streams metric, though the streams-level "total-blocked" can still
> >>> leverage
> >>> it. Similarly, I think "txn-commit-time-total" and
> >>> "offset-commit-time-total" may better be inside producer and consumer
> >>> clients respectively.
> >>>
> >>> Good call - I'll update the KIP
> >>>
> >>> > 3) The doc was not very clear on how "thread-start-time" would be
> >>> needed
> >>> when calculating streams utilization along with total-blocked time,
> could
> >>> you elaborate a bit more in the KIP?
> >>>
> >>> Yes, will do.
> >>>
> >>> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> >>> other txn-related calls may also be blocking, including
> >>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> >>> later in the doc, but did not include it as a separate metric, and
> >>> similarly, should we have a `txn-abort-time-total` as well? If yes,
> >>> could
> >>> you update the KIP page accordingly.
> >>>
> >>> Ack.
> >>>
> >>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <desai.p.ro...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hello All,
> >>>>
> >>>> I'd like to start a discussion on the KIP linked above which proposes
> >>>> some metrics that we would find useful to help measure whether a Kafka
> >>>> Streams application is saturated. The motivation section in the KIP
> goes
> >>>> into some more detail on why we think this is a useful addition to the
> >>>> metrics already implemented. Thanks in advance for your feedback!
> >>>>
> >>>> Best Regards,
> >>>>
> >>>> Rohan
> >>>>
> >>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.ro...@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>>>>
> >>>>
>

Reply via email to