Hi Hemanga, 1. I think CREATED -> STABLE could be added, please open a JIRA for it, and you are also welcome to contribute the improvement :)
2. I don't know what the right solution here is, maybe a counter with a configurable threshold for "slow" upgrades. Where you set a configurable threshold for different transitions and the metric would be increased if there is a transition slower than that. Then you can set up alerting on the counter. Cheers Gyula On Mon, Sep 29, 2025 at 4:34 PM Hemanga Borah via user < [email protected]> wrote: > Hello folks, > > Tldr; > > 1) There are no metrics for end-to-end lifecycle time for application > creation > > 2) Histogram metrics mask new points > > —--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--— > > We are using Flink with the Flink Apache Operator at Datadog. We have been > using the Flink operator metrics for monitoring and have been wondering > about two issues. > > Context: > > The metrics emitted by the operator are defined here: > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/ > > We are interested in monitoring the Lifecycle metrics, specifically for > “Lifecycle.Transition time”. We want to monitor the transition times for > new application creation and upgrades and alert on these if they exceed a > certain threshold. > > We are facing these issues: > > 1. There are no metrics for end-to-end lifecycle time for application > creation: > > If we want to monitor the end-end time taken by the operator, we need to > monitor the transition time for two operations: a) New application creation > and b) Updates to existing applications. > > The time for operation "b" (updates) is the metric TRANSITION_UPGRADE > which measures the time from STABLE back to STABLE. But there is no metric > for the operation "a" (new application creation), which would be the > measure of transition from CREATE to STABLE. > > 2. The metrics are emitted as a histogram, so we cannot use them for > alerts: > > Since the metrics are a histogram, we can see values for min, max, avg, > median, P95 and count. > > Suppose there is a very high value for an Upgrade at time T0, say 30 > minutes, the value we see for this metric will continue to be the metric to > be 30 minutes till another Upgrade operation emits a new metric at time T1, > which could be much later. So, the metric stays at 30 minutes till T2. > Additionally, if the new value is lower, say 15 minutes, at T1, and we are > monitoring the max, this will be masked by the value at T0. Ideally, for > the purposes of alerting, we would like to observe only one metric emitted > for the event at time T0, which is 30 minutes, and then either zeros or no > metrics, and at T1, we would like to see another metric for a value of 15 > minutes. Also, a new deployment clears out any histogram values because > these are stored in memory of the operator. > > For problem 2, we have explored these solutions: > > a) Change the histogram to a value of 1 item by setting > metrics.histogram.sample.size to 1. While this will help with reducing > the context to just one data point, the metric still retains this value > till another operation emits a metric. So, if we alert on this metric, the > alert will not stabilize till there is another operation and a new value is > emitted. > > b) We explored the idea of having a counter instead of a histogram. This > will also not work because the metrics are “scrape” based and not > “emission” based. So, the value of the old metric always stays. But, this > solution may work if we always increment, and alert on delta values. > > c) We created a log parser to look at the debug logs from the metrics to > extract the transition time from here: > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78. > This has worked for us nicely. However, we believe there must be a simpler > solution to get this metric. > > Configuration: > > Flink version: 1.20.2 > > Flink Apache Operator version: 1.12.1 > > Java version: 17 > >
