Hi, I have a few Flink jobs running on Kubernetes using the Flink Kubernetes Operator. By following the documentation [1] I was able to set up monitoring for the Operator itself. As for the jobs themselves, I'm a bit confused about how to properly set it up. Here's my FlinkDeployment configuration:
apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: name: sample-job namespace: flink spec: image: flink:1.17 flinkVersion: v1_17 flinkConfiguration: taskmanager.numberOfTaskSlots: "1" state.savepoints.dir: file:///flink-data/savepoints state.checkpoints.dir: file:///flink-data/checkpoints high-availability.type: kubernetes high-availability.storageDir: file:///flink-data/ha metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory metrics.reporter.prom.port: 9249-9250 serviceAccount: flink jobManager: resource: memory: "1024m" cpu: 1 taskManager: resource: memory: "1024m" cpu: 1 podTemplate: spec: containers: - name: flink-main-container volumeMounts: - mountPath: /flink-data name: flink-volume volumes: - name: flink-volume emptyDir: {} job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 1 upgradeMode: savepoint state: running savepointTriggerNonce: 0 When I exec into the pod, I can curl http://localhost:9249 and I can see the JobManager metrics. But the TaskManager metrics aren't there and nothing's running on port 9250. Both the JobManager and TaskManager are running on the same machine. There isn't any instruction on how to scrape this so I tried to modify the PodMonitor config provided for the Operator and run it which didn't work. I can see the target being registered in the Prometheus dashboard but it always stays completely blank. Here's the config I used: apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: sample-job namespace: monitoring labels: release: monitoring spec: selector: matchLabels: app: sample-job namespaceSelector: matchNames: - flink podMetricsEndpoints: - targetPort: 9249 So, here's what I want to know: 1. What should the appropriate scraping configuration look like? 2. How can I retrieve the TaskManager metrics as well? 3. In the case where I have multiple jobs potentially running on the same machine, how can I get metrics for all of them? Any help would be appreciated. Versions: Flink: 1.17.1 Flink Kubernetes Operator: 1.5.0 - [1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.5/docs/operations/metrics-logging/#how-to-enable-prometheus-example Thanks, Sunny -- SELISE Group Zürich: The Circle 37, 8058 Zürich-Airport, Switzerland Munich: Tal 44, 80331 München, Germany Dubai: Building 3, 3rd Floor, Dubai Design District, Dubai, United Arab Emirates Dhaka: Midas Center, Road 16, Dhanmondi, Dhaka 1209, Bangladesh Thimphu: Bhutan Innovation Tech Center, Babesa, P.O. Box 633, Thimphu, Bhutan Visit us: www.selisegroup.com <http://www.selisegroup.com> -- *Important Note: This e-mail and any attachment are confidential and may contain trade secrets and may well also be legally privileged or otherwise protected from disclosure. If you have received it in error, you are on notice of its status. Please notify us immediately by reply e-mail and then delete this e-mail and any attachment from your system. If you are not the intended recipient please understand that you must not copy this e-mail or any attachment or disclose the contents to any other person. Thank you for your cooperation.*