Philippe Gref-Viau created FLINK-36932: ------------------------------------------
Summary: Add resource-level metrics for different status/states to flink-kubernetes-operator Key: FLINK-36932 URL: https://issues.apache.org/jira/browse/FLINK-36932 Project: Flink Issue Type: Improvement Components: Kubernetes Operator, Runtime / Metrics Reporter: Philippe Gref-Viau Operator-specific metrics were introduced as part of FLINK-26953. These metrics are useful from a high-level reporting point of view (i.e. X many FlinkDeployments are in state Y across the namespace), but they give no insights as to the states/statuses of _individual_ (i.e. resource-level) deployments. For example, there's currently no good signal to indicate if a particular deployment is in a given lifecycle state. As part of our daily operational routine, we have found this lack of resource-level metrics painful, since we cannot create graphs or alerts that show the name of failing deployments. We can always turn to the metrics emitted by Flink itself (ex: the {{<jobStatus>State}} Gauge metric available on the JobManager) that are "faceted" by the job/deployment name, but in some cases, a problem can occur before the jobs ever get to run and/or before their metrics even get a chance to be emitted. There's also the fact that the fact that not all status/states are covered by those metrics (i.e. lifecycle states). Furthermore, the current set of metrics emitted for FlinkDeployments include namespace-level counts for each Job Manager state, but it does not include counts metrics for each Job status. Again, we can turn to metrics emitted directly by Flink itself, but we run into the limitations I mentioned above. As such, we propose the following changes: * Extending all of the existing "counter-based" metrics related to status/state, so that each status/state also has a resource-level, "gauge-based" metric that tracks whether each deployment (or the related sub-resource, i.e. job/job manager) is in a given status/state * Adding metrics to track the total count of Jobs in each status (by namespace), and a gauge-based metric for each Job status (by deployment) Another way to present the suggested changes is to show what new items would be added in the "Flink Resource Metrics" table shown on [this|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#flink-resource-metrics] page: ||Scope||Metrics||Description||Type|| |Resource|FlinkDeployment.JmDeploymentStatus.<Status>.InStatus|For a given Job Manager deployment status <Status>, return 1 if the Job Manager associated with the FlinkDeployment is currently in that status, otherwise return 0. <Status> can take values from: READY, DEPLOYED_NOT_READY, DEPLOYING, MISSING, ERROR|Gauge| |Resource|FlinkDeployment.JobStatus.<Status>.InStatus|For a given job status <Status>, return 1 if the job associated with the FlinkDeployment is currently in that status, otherwise return 0. <Status> can take values from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge| |Namespace|FlinkDeployment.JobStatus.<Status>.Count|Number of managed FlinkDeployment resources per <Status> per namespace. <Status> can take values from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge| |Resource|FlinkDeployment/FlinkSessionJob.Lifecycle.State.<State>.InState|For a given lifecycle state <State>, return 1 if the managed resource is currently in that state, otherwise return 0. <State> can take values from: CREATED, SUSPENDED, UPGRADING, DEPLOYED, STABLE, ROLLING_BACK, ROLLED_BACK, FAILED|Gauge | We've actually already implemented these changes in our fork of the flink-kubernetes-operator codebase, and it's been working pretty well. At his point, we're interested in merging the changes back into the main branch to avoid diverging from the releases share the improvement with the rest of the community and get some feedback on our implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)