Hi,

I have a few Flink jobs running on Kubernetes using the Flink Kubernetes
Operator. By following the documentation [1] I was able to set up
monitoring for the Operator itself. As for the jobs themselves, I'm a bit
confused about how to properly set it up. Here's my FlinkDeployment
configuration:

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: sample-job
  namespace: flink
spec:
  image: flink:1.17
  flinkVersion: v1_17
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "1"
    state.savepoints.dir: file:///flink-data/savepoints
    state.checkpoints.dir: file:///flink-data/checkpoints
    high-availability.type: kubernetes
    high-availability.storageDir: file:///flink-data/ha
    metrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
    metrics.reporter.prom.port: 9249-9250
  serviceAccount: flink
  jobManager:
    resource:
      memory: "1024m"
      cpu: 1
  taskManager:
    resource:
      memory: "1024m"
      cpu: 1
  podTemplate:
    spec:
      containers:
        - name: flink-main-container
          volumeMounts:
          - mountPath: /flink-data
            name: flink-volume
      volumes:
      - name: flink-volume
        emptyDir: {}
  job:
    jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
    parallelism: 1
    upgradeMode: savepoint
    state: running
    savepointTriggerNonce: 0

When I exec into the pod, I can curl http://localhost:9249 and I can see
the JobManager metrics. But the TaskManager metrics aren't there and
nothing's running on port 9250. Both the JobManager and TaskManager are
running on the same machine.

There isn't any instruction on how to scrape this so I tried to modify the
PodMonitor config provided for the Operator and run it which didn't work. I
can see the target being registered in the Prometheus dashboard but it
always stays completely blank. Here's the config I used:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: sample-job
  namespace: monitoring
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: sample-job
  namespaceSelector:
    matchNames:
    - flink
  podMetricsEndpoints:
    - targetPort: 9249

So, here's what I want to know:
1. What should the appropriate scraping configuration look like?
2. How can I retrieve the TaskManager metrics as well?
3. In the case where I have multiple jobs potentially running on the same
machine, how can I get metrics for all of them?

Any help would be appreciated.

Versions:
Flink: 1.17.1
Flink Kubernetes Operator: 1.5.0

- [1]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.5/docs/operations/metrics-logging/#how-to-enable-prometheus-example


Thanks,
Sunny

-- 









SELISE Group
Zürich: The Circle 37, 8058 Zürich-Airport, 
Switzerland
Munich: Tal 44, 80331 München, Germany
Dubai: Building 3, 3rd 
Floor, Dubai Design District, Dubai, United Arab Emirates
Dhaka: Midas 
Center, Road 16, Dhanmondi, Dhaka 1209, Bangladesh
Thimphu: Bhutan 
Innovation Tech Center, Babesa, P.O. Box 633, Thimphu, Bhutan

Visit us: 
www.selisegroup.com <http://www.selisegroup.com>




-- 




*Important Note: This e-mail and any attachment are confidential and 
may contain trade secrets and may well also be legally privileged or 
otherwise protected from disclosure. If you have received it in error, you 
are on notice of its status. Please notify us immediately by reply e-mail 
and then delete this e-mail and any attachment from your system. If you are 
not the intended recipient please understand that you must not copy this 
e-mail or any attachment or disclose the contents to any other person. 
Thank you for your cooperation.*

Reply via email to