Hi Rohan, Thank you for the KIP!
I agree that the KIP is well-motivated.What is not very clear is the metadata like type, group, and tags of the metrics. For example, there is not application-id tag in Streams and there is also no producer-id tag. The clients, i.e., producer, admin, consumer, and also Streams have a client-id tag, that corresponds to the producer-id, consumer-id, etc you use in the KIP.
For examples of metadata used in Streams you can look at the following KIPs:- https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB - https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
Best, Bruno On 22.07.21 09:42, Rohan Desai wrote:
re sophie: The intent here was to include all blocked time (not just `RUNNING`). The caller can window the total blocked time themselves, and that can be compared with a timeseries of the state to understand the ratio in different states. I'll update the KIP to include `committed`. The admin API calls should be accounted for by the admin client iotime/iowaittime metrics. On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <desai.p.ro...@gmail.com> wrote:I remember now that we moved the round-trip PID's txn completion logicinto init-transaction and commit/abort-transaction. So I think we'd count time as in StreamsProducer#initTransaction as well (admittedly it is in most cases a one-time thing). Makes sense - I'll update the KIP On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <desai.p.ro...@gmail.com> wrote:I had a question - it seems like from the descriptionsof`txn-commit-time-total` and `offset-commit-time-total` that they measure similar processes for ALOS and EOS, but only `txn-commit-time-total` is included in `blocked-time-total`. Why isn't `offset-commit-time-total` also included? I've updated the KIP to include it.Aside from `flush-time-total`, `txn-commit-time-total` and`offset-commit-time-total`, which will be producer/consumer client metrics, the rest of the metrics will be streams metrics that will be thread level, is that right? Based on the feedback from Guozhang, I've updated the KIP to reflect that the lower-level metrics are all client metrics that are then summed to compute the blocked time metric, which is a Streams metric. On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <desai.p.ro...@gmail.com> wrote:Similarly, I think "txn-commit-time-total" and"offset-commit-time-total" may better be inside producer and consumer clients respectively. I agree for offset-commit-time-total. For txn-commit-time-total I'm proposing we measure `StreamsProducer.commitTransaction`, which wraps multiple producer calls (sendOffsets, commitTransaction)For "txn-commit-time-total" specifically, besidesproducer.commitTxn. other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. `beginTransaction` is not blocking - I meant to remove that from that doc. I'll add something for abort. On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.ro...@gmail.com> wrote:Thanks for the review Guozhang! responding to your feedback inline:1) I agree that the current ratio metrics is just "snapshot inpoint", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? Of course this depends on how exactly we refactor the runtime - assuming that we plan to factor out consumers into an "I/O" layer that is responsible for receiving records and enqueuing them to be processed by processing threads, then I think it should be reasonable to count the time we spend blocked on this internal queue(s) as blocked. The main concern there to me is that the I/O layer would be doing something expensive like decompression that shouldn't be counted as "blocked". But if that really is so expensive that it starts to throw off our ratios then it's probably indicative of a larger problem that the "i/o layer" is a bottleneck and it would be worth refactoring so that decompression (or insert other expensive thing here) can also be done on the processing threads.2) [This and all below are minor comments] The "flush-time-total" maybetter be a producer client metric, as "flush-wait-time-total", than a streams metric, though the streams-level "total-blocked" can still leverage it. Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. Good call - I'll update the KIP3) The doc was not very clear on how "thread-start-time" would beneeded when calculating streams utilization along with total-blocked time, could you elaborate a bit more in the KIP? Yes, will do.For "txn-commit-time-total" specifically, besides producer.commitTxn.other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. Ack. On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <desai.p.ro...@gmail.com> wrote:Hello All, I'd like to start a discussion on the KIP linked above which proposes some metrics that we would find useful to help measure whether a Kafka Streams application is saturated. The motivation section in the KIP goes into some more detail on why we think this is a useful addition to the metrics already implemented. Thanks in advance for your feedback! Best Regards, Rohan On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.ro...@gmail.com> wrote:https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams