re sophie: The intent here was to include all blocked time (not just `RUNNING`). The caller can window the total blocked time themselves, and that can be compared with a timeseries of the state to understand the ratio in different states. I'll update the KIP to include `committed`. The admin API calls should be accounted for by the admin client iotime/iowaittime metrics.
On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <desai.p.ro...@gmail.com> wrote: > > I remember now that we moved the round-trip PID's txn completion logic > into > init-transaction and commit/abort-transaction. So I think we'd count time > as in StreamsProducer#initTransaction as well (admittedly it is in most > cases a one-time thing). > > Makes sense - I'll update the KIP > > On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <desai.p.ro...@gmail.com> > wrote: > >> >> > I had a question - it seems like from the descriptionsof >> `txn-commit-time-total` and `offset-commit-time-total` that they measure >> similar processes for ALOS and EOS, but only `txn-commit-time-total` is >> included in `blocked-time-total`. Why isn't `offset-commit-time-total` also >> included? >> >> I've updated the KIP to include it. >> >> > Aside from `flush-time-total`, `txn-commit-time-total` and >> `offset-commit-time-total`, which will be producer/consumer client >> metrics, >> the rest of the metrics will be streams metrics that will be thread level, >> is that right? >> >> Based on the feedback from Guozhang, I've updated the KIP to reflect that >> the lower-level metrics are all client metrics that are then summed to >> compute the blocked time metric, which is a Streams metric. >> >> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <desai.p.ro...@gmail.com> >> wrote: >> >>> > Similarly, I think "txn-commit-time-total" and >>> "offset-commit-time-total" may better be inside producer and consumer >>> clients respectively. >>> >>> I agree for offset-commit-time-total. For txn-commit-time-total I'm >>> proposing we measure `StreamsProducer.commitTransaction`, which wraps >>> multiple producer calls (sendOffsets, commitTransaction) >>> >>> > > For "txn-commit-time-total" specifically, besides >>> producer.commitTxn. >>> other txn-related calls may also be blocking, including >>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >>> later in the doc, but did not include it as a separate metric, and >>> similarly, should we have a `txn-abort-time-total` as well? If yes, >>> could >>> you update the KIP page accordingly. >>> >>> `beginTransaction` is not blocking - I meant to remove that from that >>> doc. I'll add something for abort. >>> >>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.ro...@gmail.com> >>> wrote: >>> >>>> Thanks for the review Guozhang! responding to your feedback inline: >>>> >>>> > 1) I agree that the current ratio metrics is just "snapshot in >>>> point", and >>>> more flexible metrics that would allow reporters to calculate based on >>>> window intervals are better. However, the current mechanism of the >>>> proposed >>>> metrics assumes the thread->clients mapping as of today, where each >>>> thread >>>> would own exclusively one main consumer, restore consumer, producer and >>>> an >>>> admin client. But this mapping may be subject to change in the future. >>>> Have >>>> you thought about how this metric can be extended when, e.g. the >>>> embedded >>>> clients and stream threads are de-coupled? >>>> >>>> Of course this depends on how exactly we refactor the runtime - >>>> assuming that we plan to factor out consumers into an "I/O" layer that is >>>> responsible for receiving records and enqueuing them to be processed by >>>> processing threads, then I think it should be reasonable to count the time >>>> we spend blocked on this internal queue(s) as blocked. The main concern >>>> there to me is that the I/O layer would be doing something expensive like >>>> decompression that shouldn't be counted as "blocked". But if that really is >>>> so expensive that it starts to throw off our ratios then it's probably >>>> indicative of a larger problem that the "i/o layer" is a bottleneck and it >>>> would be worth refactoring so that decompression (or insert other expensive >>>> thing here) can also be done on the processing threads. >>>> >>>> > 2) [This and all below are minor comments] The "flush-time-total" may >>>> better be a producer client metric, as "flush-wait-time-total", than a >>>> streams metric, though the streams-level "total-blocked" can still >>>> leverage >>>> it. Similarly, I think "txn-commit-time-total" and >>>> "offset-commit-time-total" may better be inside producer and consumer >>>> clients respectively. >>>> >>>> Good call - I'll update the KIP >>>> >>>> > 3) The doc was not very clear on how "thread-start-time" would be >>>> needed >>>> when calculating streams utilization along with total-blocked time, >>>> could >>>> you elaborate a bit more in the KIP? >>>> >>>> Yes, will do. >>>> >>>> > For "txn-commit-time-total" specifically, besides producer.commitTxn. >>>> other txn-related calls may also be blocking, including >>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >>>> later in the doc, but did not include it as a separate metric, and >>>> similarly, should we have a `txn-abort-time-total` as well? If yes, >>>> could >>>> you update the KIP page accordingly. >>>> >>>> Ack. >>>> >>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <desai.p.ro...@gmail.com> >>>> wrote: >>>> >>>>> Hello All, >>>>> >>>>> I'd like to start a discussion on the KIP linked above which proposes >>>>> some metrics that we would find useful to help measure whether a Kafka >>>>> Streams application is saturated. The motivation section in the KIP goes >>>>> into some more detail on why we think this is a useful addition to the >>>>> metrics already implemented. Thanks in advance for your feedback! >>>>> >>>>> Best Regards, >>>>> >>>>> Rohan >>>>> >>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.ro...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >>>>>> >>>>>