Philippe Gref-Viau created FLINK-36932:
------------------------------------------

             Summary: Add resource-level metrics for different status/states to 
flink-kubernetes-operator
                 Key: FLINK-36932
                 URL: https://issues.apache.org/jira/browse/FLINK-36932
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator, Runtime / Metrics
            Reporter: Philippe Gref-Viau


Operator-specific metrics were introduced as part of FLINK-26953. These metrics 
are useful from a high-level reporting point of view (i.e. X many 
FlinkDeployments are in state Y across the namespace), but they give no 
insights as to the states/statuses of _individual_ (i.e. resource-level) 
deployments. For example, there's currently no good signal to indicate if a 
particular deployment is in a given lifecycle state.

As part of our daily operational routine, we have found this lack of 
resource-level metrics painful, since we cannot create graphs or alerts that 
show the name of failing deployments. We can always turn to the metrics emitted 
by Flink itself (ex: the {{<jobStatus>State}} Gauge metric available on the 
JobManager) that are "faceted" by the job/deployment name, but in some cases, a 
problem can occur before the jobs ever get to run and/or before their metrics 
even get a chance to be emitted. There's also the fact that the fact that not 
all status/states are covered by those metrics (i.e. lifecycle states).

Furthermore, the current set of metrics emitted for FlinkDeployments include 
namespace-level counts for each Job Manager state, but it does not include 
counts metrics for each Job status. Again, we can turn to metrics emitted 
directly by Flink itself, but we run into the limitations I mentioned above.

As such, we propose the following changes:
 * Extending all of the existing "counter-based" metrics related to 
status/state, so that each status/state also has a resource-level, 
"gauge-based" metric that tracks whether each deployment (or the related 
sub-resource, i.e. job/job manager) is in a given status/state
 * Adding metrics to track the total count of Jobs in each status (by 
namespace), and a gauge-based metric for each Job status (by deployment)

Another way to present the suggested changes is to show what new items would be 
added in the "Flink Resource Metrics" table shown on 
[this|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#flink-resource-metrics]
 page:
||Scope||Metrics||Description||Type||
|Resource|FlinkDeployment.JmDeploymentStatus.<Status>.InStatus|For a given Job 
Manager deployment status <Status>, return 1 if the Job Manager associated with 
the FlinkDeployment is currently in that status, otherwise return 0. <Status> 
can take values from: READY, DEPLOYED_NOT_READY, DEPLOYING, MISSING, 
ERROR|Gauge|
|Resource|FlinkDeployment.JobStatus.<Status>.InStatus|For a given job status 
<Status>, return 1 if the job associated with the FlinkDeployment is currently 
in that status, otherwise return 0. <Status> can take values from: CANCELED, 
CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, RECONCILING, 
RESTARTING, RUNNING, SUSPENDED|Gauge|
|Namespace|FlinkDeployment.JobStatus.<Status>.Count|Number of managed 
FlinkDeployment resources per <Status> per namespace. <Status> can take values 
from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, 
RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
|Resource|FlinkDeployment/FlinkSessionJob.Lifecycle.State.<State>.InState|For a 
given lifecycle state <State>, return 1 if the managed resource is currently in 
that state, otherwise return 0.  <State> can take values from: CREATED, 
SUSPENDED, UPGRADING, DEPLOYED, STABLE, ROLLING_BACK, ROLLED_BACK, FAILED|Gauge
 
|

 

We've actually already implemented these changes in our fork of the 
flink-kubernetes-operator codebase, and it's been working pretty well. At his 
point, we're interested in merging the changes back into the main branch to 
avoid diverging from the releases share the improvement with the rest of the 
community and get some feedback on our implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to