Hi Nick, Thanks for the reviews, and I think these are good suggestions. Note that currently the `restore-records-total/rate` would include both restoring active tasks as well as updating standby tasks, I think for your purposes you'd be more interested in active restoring tasks since they could complete, while updating standby tasks would not complete even if they have caught up with the active. At the same time, the changelog reader would only be restoring either active or standby at a given time, and active tasks has a higher priority such that as long as there is at least one active task still restoring, we would not restore any standby tasks. From your suggestion, I'm thinking that maybe I should break up the rate / total metric for active and standby tasks separately.
For deriving estimated time remaining though, the `total` metric may not be helpful since they will not be "reset" after rebalances, i.e. they will be an ever-increasing number and record the total number of records for the lifetime of the app. But still, just the remaining records alone, with the time elapsed monitored by the apps, should be sufficient to get the estimated time remaining. Guozhang On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <nick.telf...@gmail.com> wrote: > Hi Guozhang, > > KIP looks great, I have one suggestion: in addition to > "restore-records-total", it would also be useful to track the number of > records *remaining*, that is, the records that have not yet been restored. > This is actually the metric I was attempting to implement in the > StateRestoreListener that bumped me up against KAFKA-10575 :-) > > With both a "restore-records-total" and a "restore-remaining-total" (or > similar) metric, it's possible to derive useful information like the > estimated time remaining for restoration (by dividing the remaining total > by the restoration rate). > > Regards, > > Nick > > On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wangg...@gmail.com> wrote: > > > Hello Bruno, > > > > Thanks for your comments! > > > > 1. Regarding the metrics group name: originally I put > > "stream-state-metrics" as it's related to state store restorations, but > > after a second thought I think I agree with you that this is quite > > confusing and not right. About the metrics groups, right now I have two > > ideas: > > > > a) Still use the metric group name "stream-thread-metrics", but > > differentiate with the processing threads on the thread id. The pros is > > that we do not introduce a new group, the cons is that users who want to > > separate processing from restoration/updating in the future needs to do > > that on the thread id labels. > > b) Introduce a new group name, for example "stream-state-updater-metrics" > > and still have a thread-id label. We would be introducing a new group > which > > still have a thread-id, which may sound a bit weird (maybe if we do that > we > > could change the existing "stream-thread-metrics" into > > "stream-processor-metrics"). > > > > Right now I'm leaning towards b) and maybe in the future rename > > "thread-metrics" to "processor-metrics", LMK what do you think. > > > > 2. Regarding the metric names: today we may also pause a standby tasks, > and > > hence I'm trying to differentiate updating standbys from paused standbys. > > Right now I'm thinking we can change "restoring-standby-tasks" to > > "updating-standby-tasks", and change all other metrics' "restore" (except > > the "restoring-active-tasks") to "state-update", a.k.a > > "state-update-ratio", "state-update-records-total", > > "updating-standby-tasks" etc. > > > > 3. Regarding the function name: yeah I think that's a valid concern, I > can > > change it to "onRestoreSuspended" since "Aborted" may confuse people that > > previously called "onBatchRestored" are undone as part of the abortion. > > > > > > Guozhang > > > > > > > > On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <cado...@apache.org> > wrote: > > > > > Hi Guozhang, > > > > > > Thanks for the KIP! I think this KIP is a really nice addition to > better > > > understand what is going on in a Kafka Streams application. > > > > > > 1. > > > The metric names "paused-active-tasks" and "paused-standby-tasks" might > > > be a bit confusing since at least active tasks can be paused also > > > outside of restoration. > > > > > > 2. > > > Why is the type of the metrics "stream-state-metrics"? I would have > > > expected "stream-thread-metrics" as the type. > > > > > > 3. > > > Isn't the value of the metric "restoring-standby-tasks" simply the > > > number of standby tasks since standby tasks are basically always > > > updating (aka restoring)? > > > > > > 4. > > > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics > > > tailored to the upcoming state updater. They do not make much sense > with > > > a stream thread. Would it be better to introduce a new metrics level > > > specifically for the state updater? > > > > > > 5. > > > Personally, I do not like to use the word "restoration" together with > > > standbys since restoration somehow implies that there is an offset for > > > which the active task is considered restored and active processing can > > > start. In other words, restoration is finite. Standby tasks rather > > > update continuously their states. They can be up-to-date or lagging. I > > > see that you could say "restored" instead of "up-to-date" and "not > > > restored" instead of "lagging", but IMO it does not describe well the > > > situation. That is a rather minor point. I just wanted to mention it. > > > > > > 6. > > > The name "onRestorePaused()" might be confusing since in Kafka Streams > > > users can also pause tasks. What about "onRestoreAborted()" or > > > "onRestoreSuspended"? > > > > > > Best, > > > Bruno > > > > > > > > > On 16.09.22 19:33, Guozhang Wang wrote: > > > > Hello everyone, > > > > > > > > I'd like to start a discussion for the following KIP, aiming to > improve > > > > Kafka Stream's restoration visibility via new metrics and callback > > > methods: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > > > > > > > > > > > Thanks! > > > > -- Guozhang > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang