Hi Andrew! I think you are completely right, this is a bug. The per namespace metrics do not seem to filter per namespace and show the aggregated global count for each namespace:
I opened a ticket: https://issues.apache.org/jira/browse/FLINK-32164 Thanks for reporting this! Gyula On Mon, May 22, 2023 at 10:49 PM Andrew Otto <o...@wikimedia.org> wrote: > Also! I do have 2 FlinkDeployments deployed with this operator, but they > are in different namespaces, and each of the per namespace metrics reports > that it has 2 Deployments in them, even though there is only one according > to kubectl. > > Actually...we just tried to deploy a change (enabling some checkpointing) > that caused one of our FlinkDeployments to fail. Now, both namespace > STABLE_Counts each report 1. > > # curl -s <pod_ip>:<prom_port> | grep > flink_k8soperator_namespace_Lifecycle_State_STABLE_Count > flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",} > 1.0 > flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="rdf_streaming_updater",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",} > 1.0 > > It looks like maybe this metric is not reporting per namespace, but a > global count. > > > > On Mon, May 22, 2023 at 2:56 PM Andrew Otto <o...@wikimedia.org> wrote: > >> Oh, FWIW, I do have operator HA enabled with 2 replicas running, but in >> my examples there, I am curl-ing the leader flink operator pod. >> >> >> >> On Mon, May 22, 2023 at 2:47 PM Andrew Otto <o...@wikimedia.org> wrote: >> >>> Hello! >>> >>> I'm doing some grafana+prometheus dashboarding for >>> flink-kubernetes-operator. Reading metrics docs >>> <https://stackoverflow.com/a/61795256>, I see that I have nice per k8s >>> namespace lifecycle current count gauge metrics in Prometheus. >>> >>> Using kubectl, I can see that I have one FlinkDeployment in my namespace: >>> >>> # kubectl -n stream-enrichment-poc get flinkdeployments >>> NAME JOB STATUS LIFECYCLE STATE >>> flink-app-main RUNNING STABLE >>> >>> But, prometheus is reporting that I have 2 FlinkDeployments in the >>> STABLE state. >>> >>> # curl -s <pod_ip>:<prom_port> | grep >>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count >>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",} >>> 2.0 >>> >>> I'm not sure why I see 2.0 reported. >>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count reports only >>> one FlinkDeployment. >>> >>> # curl <pod_ip>:<prom_port>/metrics | grep >>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count >>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",} >>> 1.0 >>> >>> Is it possible that >>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count is being >>> reported as an incrementing counter instead of a guage? >>> >>> Thanks >>> -Andrew Otto >>> Wikimedia Foundation >>> >>>