Hey, I'm changing my vote to binding now :) On Mon, Jan 23, 2023 at 9:38 PM Matthias J. Sax <mj...@apache.org> wrote:
> Thanks Guozhang. Couple of clarifications and follow up questions. > > > >> I'm not aware of a discussion to rename the call name to "suspend" for > >> KIP-834. Could you point me to the reference? > > My commend was not about KIP-834, but about this KIP. You originally > proposed to call the new call-back `onRestorePause` but to avoid > confusion it was improved and renamed to `onRestoreSuspended`. > > > > The only one so far that I feel is probably better, is > > "state-update-ratio". If folks feel this one is better than > > "restore-ratio" I'm happy to update. > > Could we actually report two metric, one for the restore phase > (restore-ration), and one for steady state ([standby-]update-ratio)? > > I could like with `state-update-ratio` if we want to have a single > metric for both, but splitting them sound useful to me. > > > > (4) `restore-call-rate` > > Maybe we can clarify in the description a little bit. I agree it's very > low level but if you think it could be useful to debugging, I have no > objection. > > > > The rationale behind it is the general principle in metrics design > > that "Kafka would provide the lowest necessary metrics levels, and > > users can do the roll-ups however they want". > > That's fair, but it seems to be a rather important metric, and having it > only at DEBUG level seems not ideal? Could we make it INFO level, even > if it's a task level (ie, apply an exception to the rule). > > > > -Matthias > > > > On 1/19/23 2:35 PM, Guozhang Wang wrote: > > Hello Matthias, > > > > Thanks for the feedback. I was on vacation for a while. Pardon for the > > late replies. Please see them inline below > > > > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> > wrote: > >> > >> Seems I am late to the party... Great KIP. Couple of questions from my > side: > >> > >> (1) What is the purpose of `standby-updating-tasks`? It seems to be the > >> same as the number of assigned standby task? Not sure how useful it > >> would be? > >> > > In general, yes, it is the number of assigned standby tasks --- there > > will be transit times when the assigned standby tasks are not yet > > being updated but it would not last long --- but we do not yet have a > > direct gauge to expose this before, and users have to infer this from > > other indirect metrics. > > > >> > >> > >> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > >> exactly mean? There was a discussion about renaming the callback method > >> from pause to suspended. So should this be called `suspended`, too? And > >> if yes, how is it useful for users? > >> > > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > > Topologies" ( > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > ). > > When a topology is paused, all its tasks including standbys will be > > paused too. > > > > I'm not aware of a discussion to rename the call name to "suspend" for > > KIP-834. Could you point me to the reference? > > > >> > >> > >> (3) `restore-ratio`: the description says > >> > >>> The fraction of time the thread spent on restoring active or standby > tasks > >> > >> I find the term "restoring" does only apply to active tasks, but not to > >> standbys. Can we reword this? > >> > > Yeah I have been discussing this with others in the community a bit as > > well, but so far I have not been convinced of a better name than it. > > Some other alternatives being discussed but not win everyone's love is > > "restore-or-update-ratio", "process-ratio" (for the restore thread > > that means restoring or updating), and "io-ratio". > > > > The only one so far that I feel is probably better, is > > "state-update-ratio". If folks feel this one is better than > > "restore-ratio" I'm happy to update. > > > >> > >> (4) `restore-call-rate`: not sure what you exactly mean by "restore > calls"? > >> > > This is similar to the "io-calls-rate" in the selector classes, i.e. > > the number of "restore" function calls made. It's argurably a very > > low-level metrics but I included it since it could be useful in some > > debugging scenarios. > > > >> > >> (5) `restore-remaining-records-total` -- why is this a task metric? > >> Seems we could roll it up into a thread metric that we report at INFO > >> level (we could still have per-task DEBUG level metric for it in > addition). > >> > > The rationale behind it is the general principle in metrics design > > that "Kafka would provide the lowest necessary metrics levels, and > > users can do the roll-ups however they want". > > > >> > >> (6) What about "warmup tasks"? Internally, we treat them as standbys, > >> but it seems it's hard for users to reason about it in the scale-out > >> warm-up case. Would it be helpful (and possible) to report "warmup > >> progress" explicitly? > >> > > At the restore thread level, we cannot differentiate standby tasks > > from warmup tasks since the latter is created exactly just like the > > former. But I do agree this is an issue for visibility that worth > > addressing, I think another KIP would be needed to first consider > > distinguishing these two at the class level. > > > >> > >> -Matthias > >> > >> > >> On 11/1/22 2:44 AM, Lucas Brutschy wrote: > >>> We need this! > >>> > >>> + 1 non binding > >>> > >>> Cheers, > >>> Lucas > >>> > >>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> > wrote: > >>>> > >>>> Guozhang, > >>>> > >>>> Thanks for the KIP! > >>>> > >>>> +1 (binding) > >>>> > >>>> Best, > >>>> Bruno > >>>> > >>>> On 25.10.22 22:07, Walker Carlson wrote: > >>>>> +1 non binding > >>>>> > >>>>> Thanks for the kip! > >>>>> > >>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> > wrote: > >>>>> > >>>>>> Thanks for the KIP, Guozhang! > >>>>>> > >>>>>> I'm +1 (binding) > >>>>>> > >>>>>> -John > >>>>>> > >>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > >>>>>>> Can't wait! > >>>>>>> +1 (non-binding) > >>>>>>> > >>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, < > guozhang.wang...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hello all, > >>>>>>>> > >>>>>>>> I'd like to start a vote for the following KIP, aiming to improve > Kafka > >>>>>>>> Stream's restoration visibility via new metrics and callback > methods: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> -- Guozhang > >>>>>>>> > >>>>>> > >>>>> >