Hi Hemanga,

1. I think CREATED -> STABLE could be added, please open a JIRA for it, and
you are also welcome to contribute the improvement :)

2. I don't know what the right solution here is, maybe a counter with a
configurable threshold for "slow" upgrades. Where you set a configurable
threshold for different transitions and the metric would be increased if
there is a transition slower than that. Then you can set up alerting on the
counter.

Cheers
Gyula

On Mon, Sep 29, 2025 at 4:34 PM Hemanga Borah via user <
[email protected]> wrote:

> Hello folks,
>
> Tldr;
>
> 1) There are no metrics for end-to-end lifecycle time for application
> creation
>
> 2) Histogram metrics mask new points
>
> —--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—
>
> We are using Flink with the Flink Apache Operator at Datadog. We have been
> using the Flink operator metrics for monitoring and have been wondering
> about two issues.
>
> Context:
>
> The metrics emitted by the operator are defined here:
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/
>
> We are interested in monitoring the Lifecycle metrics, specifically for
> “Lifecycle.Transition time”. We want to monitor the transition times for
> new application creation and upgrades and alert on these if they exceed a
> certain threshold.
>
> We are facing these issues:
>
> 1. There are no metrics for end-to-end lifecycle time for application
> creation:
>
> If we want to monitor the end-end time taken by the operator, we need to
> monitor the transition time for two operations: a) New application creation
> and b) Updates to existing applications.
>
> The time for operation "b" (updates) is the metric TRANSITION_UPGRADE
> which measures the time from STABLE back to STABLE. But there is no metric
> for the operation "a" (new application creation), which would be the
> measure of transition from CREATE to STABLE.
>
> 2. The metrics are emitted as a histogram, so we cannot use them for
> alerts:
>
> Since the metrics are a histogram, we can see values for min, max, avg,
> median, P95 and count.
>
> Suppose there is a very high value for an Upgrade at time T0, say 30
> minutes, the value we see for this metric will continue to be the metric to
> be 30 minutes till another Upgrade operation emits a new metric at time T1,
> which could be much later. So, the metric stays at 30 minutes till T2.
> Additionally, if the new value is lower, say 15 minutes, at T1, and we are
> monitoring the max, this will be masked by the value at T0. Ideally, for
> the purposes of alerting, we would like to observe only one metric emitted
> for the event at time T0, which is 30 minutes, and then either zeros or no
> metrics, and at T1, we would like to see another metric for a value of 15
> minutes. Also, a new deployment clears out any histogram values because
> these are stored in memory of the operator.
>
> For problem 2, we have explored these solutions:
>
> a) Change the histogram to a value of 1 item by setting
> metrics.histogram.sample.size to 1. While this will help with reducing
> the context to just one data point, the metric still retains this value
> till another operation emits a metric. So, if we alert on this metric, the
> alert will not stabilize till there is another operation and a new value is
> emitted.
>
> b) We explored the idea of having a counter instead of a histogram. This
> will also not work because the metrics are “scrape” based and not
> “emission” based. So, the value of the old metric always stays. But, this
> solution may work if we always increment, and alert on delta values.
>
> c) We created a log parser to look at the debug logs from the metrics to
> extract the transition time from here:
> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78.
> This has worked for us nicely. However, we believe there must be a simpler
> solution to get this metric.
>
> Configuration:
>
> Flink version: 1.20.2
>
> Flink Apache Operator version: 1.12.1
>
> Java version: 17
>
>

Reply via email to