Hello folks,

Tldr;

1) There are no metrics for end-to-end lifecycle time for application
creation

2) Histogram metrics mask new points

—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—--—

We are using Flink with the Flink Apache Operator at Datadog. We have been
using the Flink operator metrics for monitoring and have been wondering
about two issues.

Context:

The metrics emitted by the operator are defined here:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/

We are interested in monitoring the Lifecycle metrics, specifically for
“Lifecycle.Transition time”. We want to monitor the transition times for
new application creation and upgrades and alert on these if they exceed a
certain threshold.

We are facing these issues:

1. There are no metrics for end-to-end lifecycle time for application
creation:

If we want to monitor the end-end time taken by the operator, we need to
monitor the transition time for two operations: a) New application creation
and b) Updates to existing applications.

The time for operation "b" (updates) is the metric TRANSITION_UPGRADE which
measures the time from STABLE back to STABLE. But there is no metric for
the operation "a" (new application creation), which would be the measure of
transition from CREATE to STABLE.

2. The metrics are emitted as a histogram, so we cannot use them for alerts:

Since the metrics are a histogram, we can see values for min, max, avg,
median, P95 and count.

Suppose there is a very high value for an Upgrade at time T0, say 30
minutes, the value we see for this metric will continue to be the metric to
be 30 minutes till another Upgrade operation emits a new metric at time T1,
which could be much later. So, the metric stays at 30 minutes till T2.
Additionally, if the new value is lower, say 15 minutes, at T1, and we are
monitoring the max, this will be masked by the value at T0. Ideally, for
the purposes of alerting, we would like to observe only one metric emitted
for the event at time T0, which is 30 minutes, and then either zeros or no
metrics, and at T1, we would like to see another metric for a value of 15
minutes. Also, a new deployment clears out any histogram values because
these are stored in memory of the operator.

For problem 2, we have explored these solutions:

a) Change the histogram to a value of 1 item by setting
metrics.histogram.sample.size to 1. While this will help with reducing the
context to just one data point, the metric still retains this value till
another operation emits a metric. So, if we alert on this metric, the alert
will not stabilize till there is another operation and a new value is
emitted.

b) We explored the idea of having a counter instead of a histogram. This
will also not work because the metrics are “scrape” based and not
“emission” based. So, the value of the old metric always stays. But, this
solution may work if we always increment, and alert on delta values.

c) We created a log parser to look at the debug logs from the metrics to
extract the transition time from here:
https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/lifecycle/ResourceLifecycleMetricTracker.java#L78.
This has worked for us nicely. However, we believe there must be a simpler
solution to get this metric.

Configuration:

Flink version: 1.20.2

Flink Apache Operator version: 1.12.1

Java version: 17

Reply via email to