Hello folks, Tldr;
1) There are no metrics for end-to-end lifecycle time for application creation 2) Histogram metrics mask new points —--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--— We are using Flink with the Flink Apache Operator at Datadog. We have been using the Flink operator metrics for monitoring and have been wondering about two issues. Context: The metrics emitted by the operator are defined here: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/ We are interested in monitoring the Lifecycle metrics, specifically for “Lifecycle.Transition time”. We want to monitor the transition times for new application creation and upgrades and alert on these if they exceed a certain threshold. We are facing these issues: 1. There are no metrics for end-to-end lifecycle time for application creation: If we want to monitor the end-end time taken by the operator, we need to monitor the transition time for two operations: a) New application creation and b) Updates to existing applications. The time for operation "b" (updates) is the metric TRANSITION_UPGRADE which measures the time from STABLE back to STABLE. But there is no metric for the operation "a" (new application creation), which would be the measure of transition from CREATE to STABLE. 2. The metrics are emitted as a histogram, so we cannot use them for alerts: Since the metrics are a histogram, we can see values for min, max, avg, median, P95 and count. Suppose there is a very high value for an Upgrade at time T0, say 30 minutes, the value we see for this metric will continue to be the metric to be 30 minutes till another Upgrade operation emits a new metric at time T1, which could be much later. So, the metric stays at 30 minutes till T2. Additionally, if the new value is lower, say 15 minutes, at T1, and we are monitoring the max, this will be masked by the value at T0. Ideally, for the purposes of alerting, we would like to observe only one metric emitted for the event at time T0, which is 30 minutes, and then either zeros or no metrics, and at T1, we would like to see another metric for a value of 15 minutes. Also, a new deployment clears out any histogram values because these are stored in memory of the operator. For problem 2, we have explored these solutions: a) Change the histogram to a value of 1 item by setting metrics.histogram.sample.size to 1. While this will help with reducing the context to just one data point, the metric still retains this value till another operation emits a metric. So, if we alert on this metric, the alert will not stabilize till there is another operation and a new value is emitted. b) We explored the idea of having a counter instead of a histogram. This will also not work because the metrics are “scrape” based and not “emission” based. So, the value of the old metric always stays. But, this solution may work if we always increment, and alert on delta values. c) We created a log parser to look at the debug logs from the metrics to extract the transition time from here: https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78. This has worked for us nicely. However, we believe there must be a simpler solution to get this metric. Configuration: Flink version: 1.20.2 Flink Apache Operator version: 1.12.1 Java version: 17
