Re: Flink Kubernetes Operator Metrics

Hemanga Borah via user Fri, 24 Oct 2025 15:01:23 -0700

Hi Gyula,

Thanks for the suggestions!


Sounds good regarding opening a JIRA for the CREATED -> STABLE metric. I'll
make the changes and contribute to the open source.

For the second point, I will likely add a counter metric that increments
for any operation, and then we can observe delta values. This will also
require a code change in the OSS, and I will explore this and test it out
before making any changes.

Thanks,
Hemanga

On Wed, Oct 1, 2025 at 9:05 AM Gyula Fóra <[email protected]> wrote:

> Hi Hemanga,
>
> 1. I think CREATED -> STABLE could be added, please open a JIRA for it,
> and you are also welcome to contribute the improvement :)
>
> 2. I don't know what the right solution here is, maybe a counter with a
> configurable threshold for "slow" upgrades. Where you set a configurable
> threshold for different transitions and the metric would be increased if
> there is a transition slower than that. Then you can set up alerting on the
> counter.
>
> Cheers
> Gyula
>
> On Mon, Sep 29, 2025 at 4:34 PM Hemanga Borah via user <
> [email protected]> wrote:
>
>> Hello folks,
>>
>> Tldr;
>>
>> 1) There are no metrics for end-to-end lifecycle time for application
>> creation
>>
>> 2) Histogram metrics mask new points
>>
>> —--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—
>>
>> We are using Flink with the Flink Apache Operator at Datadog. We have
>> been using the Flink operator metrics for monitoring and have been
>> wondering about two issues.
>>
>> Context:
>>
>> The metrics emitted by the operator are defined here:
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/
>>
>> We are interested in monitoring the Lifecycle metrics, specifically for
>> “Lifecycle.Transition time”. We want to monitor the transition times for
>> new application creation and upgrades and alert on these if they exceed a
>> certain threshold.
>>
>> We are facing these issues:
>>
>> 1. There are no metrics for end-to-end lifecycle time for application
>> creation:
>>
>> If we want to monitor the end-end time taken by the operator, we need to
>> monitor the transition time for two operations: a) New application creation
>> and b) Updates to existing applications.
>>
>> The time for operation "b" (updates) is the metric TRANSITION_UPGRADE
>> which measures the time from STABLE back to STABLE. But there is no metric
>> for the operation "a" (new application creation), which would be the
>> measure of transition from CREATE to STABLE.
>>
>> 2. The metrics are emitted as a histogram, so we cannot use them for
>> alerts:
>>
>> Since the metrics are a histogram, we can see values for min, max, avg,
>> median, P95 and count.
>>
>> Suppose there is a very high value for an Upgrade at time T0, say 30
>> minutes, the value we see for this metric will continue to be the metric to
>> be 30 minutes till another Upgrade operation emits a new metric at time T1,
>> which could be much later. So, the metric stays at 30 minutes till T2.
>> Additionally, if the new value is lower, say 15 minutes, at T1, and we are
>> monitoring the max, this will be masked by the value at T0. Ideally, for
>> the purposes of alerting, we would like to observe only one metric emitted
>> for the event at time T0, which is 30 minutes, and then either zeros or no
>> metrics, and at T1, we would like to see another metric for a value of 15
>> minutes. Also, a new deployment clears out any histogram values because
>> these are stored in memory of the operator.
>>
>> For problem 2, we have explored these solutions:
>>
>> a) Change the histogram to a value of 1 item by setting
>> metrics.histogram.sample.size to 1. While this will help with reducing
>> the context to just one data point, the metric still retains this value
>> till another operation emits a metric. So, if we alert on this metric, the
>> alert will not stabilize till there is another operation and a new value is
>> emitted.
>>
>> b) We explored the idea of having a counter instead of a histogram. This
>> will also not work because the metrics are “scrape” based and not
>> “emission” based. So, the value of the old metric always stays. But, this
>> solution may work if we always increment, and alert on delta values.
>>
>> c) We created a log parser to look at the debug logs from the metrics to
>> extract the transition time from here:
>> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78.
>> This has worked for us nicely. However, we believe there must be a simpler
>> solution to get this metric.
>>
>> Configuration:
>>
>> Flink version: 1.20.2
>>
>> Flink Apache Operator version: 1.12.1
>>
>> Java version: 17
>>
>>

Re: Flink Kubernetes Operator Metrics

Reply via email to