Hi Gyula, Thanks for the suggestions!
Sounds good regarding opening a JIRA for the CREATED -> STABLE metric. I'll make the changes and contribute to the open source. For the second point, I will likely add a counter metric that increments for any operation, and then we can observe delta values. This will also require a code change in the OSS, and I will explore this and test it out before making any changes. Thanks, Hemanga On Wed, Oct 1, 2025 at 9:05 AM Gyula Fóra <[email protected]> wrote: > Hi Hemanga, > > 1. I think CREATED -> STABLE could be added, please open a JIRA for it, > and you are also welcome to contribute the improvement :) > > 2. I don't know what the right solution here is, maybe a counter with a > configurable threshold for "slow" upgrades. Where you set a configurable > threshold for different transitions and the metric would be increased if > there is a transition slower than that. Then you can set up alerting on the > counter. > > Cheers > Gyula > > On Mon, Sep 29, 2025 at 4:34 PM Hemanga Borah via user < > [email protected]> wrote: > >> Hello folks, >> >> Tldr; >> >> 1) There are no metrics for end-to-end lifecycle time for application >> creation >> >> 2) Histogram metrics mask new points >> >> —--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--— >> >> We are using Flink with the Flink Apache Operator at Datadog. We have >> been using the Flink operator metrics for monitoring and have been >> wondering about two issues. >> >> Context: >> >> The metrics emitted by the operator are defined here: >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/ >> >> We are interested in monitoring the Lifecycle metrics, specifically for >> “Lifecycle.Transition time”. We want to monitor the transition times for >> new application creation and upgrades and alert on these if they exceed a >> certain threshold. >> >> We are facing these issues: >> >> 1. There are no metrics for end-to-end lifecycle time for application >> creation: >> >> If we want to monitor the end-end time taken by the operator, we need to >> monitor the transition time for two operations: a) New application creation >> and b) Updates to existing applications. >> >> The time for operation "b" (updates) is the metric TRANSITION_UPGRADE >> which measures the time from STABLE back to STABLE. But there is no metric >> for the operation "a" (new application creation), which would be the >> measure of transition from CREATE to STABLE. >> >> 2. The metrics are emitted as a histogram, so we cannot use them for >> alerts: >> >> Since the metrics are a histogram, we can see values for min, max, avg, >> median, P95 and count. >> >> Suppose there is a very high value for an Upgrade at time T0, say 30 >> minutes, the value we see for this metric will continue to be the metric to >> be 30 minutes till another Upgrade operation emits a new metric at time T1, >> which could be much later. So, the metric stays at 30 minutes till T2. >> Additionally, if the new value is lower, say 15 minutes, at T1, and we are >> monitoring the max, this will be masked by the value at T0. Ideally, for >> the purposes of alerting, we would like to observe only one metric emitted >> for the event at time T0, which is 30 minutes, and then either zeros or no >> metrics, and at T1, we would like to see another metric for a value of 15 >> minutes. Also, a new deployment clears out any histogram values because >> these are stored in memory of the operator. >> >> For problem 2, we have explored these solutions: >> >> a) Change the histogram to a value of 1 item by setting >> metrics.histogram.sample.size to 1. While this will help with reducing >> the context to just one data point, the metric still retains this value >> till another operation emits a metric. So, if we alert on this metric, the >> alert will not stabilize till there is another operation and a new value is >> emitted. >> >> b) We explored the idea of having a counter instead of a histogram. This >> will also not work because the metrics are “scrape” based and not >> “emission” based. So, the value of the old metric always stays. But, this >> solution may work if we always increment, and alert on delta values. >> >> c) We created a log parser to look at the debug logs from the metrics to >> extract the transition time from here: >> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78. >> This has worked for us nicely. However, we believe there must be a simpler >> solution to get this metric. >> >> Configuration: >> >> Flink version: 1.20.2 >> >> Flink Apache Operator version: 1.12.1 >> >> Java version: 17 >> >>
