Hi Guozhang,
Thanks for the KIP! I think this KIP is a really nice addition to better
understand what is going on in a Kafka Streams application.
1.
The metric names "paused-active-tasks" and "paused-standby-tasks" might
be a bit confusing since at least active tasks can be paused also
outside of restoration.
2.
Why is the type of the metrics "stream-state-metrics"? I would have
expected "stream-thread-metrics" as the type.
3.
Isn't the value of the metric "restoring-standby-tasks" simply the
number of standby tasks since standby tasks are basically always
updating (aka restoring)?
4.
"idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
tailored to the upcoming state updater. They do not make much sense with
a stream thread. Would it be better to introduce a new metrics level
specifically for the state updater?
5.
Personally, I do not like to use the word "restoration" together with
standbys since restoration somehow implies that there is an offset for
which the active task is considered restored and active processing can
start. In other words, restoration is finite. Standby tasks rather
update continuously their states. They can be up-to-date or lagging. I
see that you could say "restored" instead of "up-to-date" and "not
restored" instead of "lagging", but IMO it does not describe well the
situation. That is a rather minor point. I just wanted to mention it.
6.
The name "onRestorePaused()" might be confusing since in Kafka Streams
users can also pause tasks. What about "onRestoreAborted()" or
"onRestoreSuspended"?
Best,
Bruno
On 16.09.22 19:33, Guozhang Wang wrote:
Hello everyone,
I'd like to start a discussion for the following KIP, aiming to improve
Kafka Stream's restoration visibility via new metrics and callback methods:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
Thanks!
-- Guozhang