Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-20 Thread Chesnay Schepler
What version are you using, and if you are using 1.13+, are you using the adaptive scheduler or reactive mode? On 20/10/2021 07:39, Clemens Valiente wrote: Hi Chesnay, thanks a lot for the clarification. We managed to resolve the collision, and isolated a problem to the metrics themselves. U

Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-19 Thread Clemens Valiente
Hi Chesnay, thanks a lot for the clarification. We managed to resolve the collision, and isolated a problem to the metrics themselves. Using the REST API at /jobs//metrics?get=uptime the response is [{"id":"uptime","value":"-1"}] despite the job running and processing data for 5 days at that point

Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-14 Thread Chesnay Schepler
I think you are misunderstanding a few things. a) when you include a variable in the scope format, then Flink fills that in /before/ it reaches Datadog. If you set it to "flink.", then what we send to Datadog is "flink.myAwesomeJob". b) the exception you see is not coming from Datadog. They occ

[External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-12 Thread Clemens Valiente
Hi, we are using datadog as our metrics reporter as documented here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog our jobmanager scope is metrics.scope.jm: flink.jobmanager metrics.scope.jm.job: flink.jobmanager since datadog doesn't all