Re: DataDog and Flink

2021-03-24 Thread Vishal Santoshi
yep, not a single EP that does all the dump but something like this works ( dirty but who cares :)) .. The vertex metrics are the most numerous any way ```curl -s http:///jobs/[job_id] | jq -r '.vertices' | jq '.[].id' | xargs -I {} curl http://xx/jobs/[job_id]/vertices/{}/metrics | jq

Re: DataDog and Flink

2021-03-24 Thread Vishal Santoshi
Yes, I will do that. Regarding the metrics dump through REST, it does provide for the TM specific but refuses to do it for all jobs and vertices/operators etc .Moreover I am not sure I have access to the vertices ( vertex_id ) readily from the UI. curl http://[jm]/taskmanagers/[tm_id] curl http:

Re: DataDog and Flink

2021-03-24 Thread Arvid Heise
Hi Vishal, REST API is the most direct way to get through all metrics as Matthias pointed out. Additionally, you could also add a JMX reporter and log to the machines to check. But in general, I think you are on the right track. You need to reduce the metrics that are sent to DD by configuring th

Re: DataDog and Flink

2021-03-24 Thread Matthias Pohl
Hi Vishal, what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else? Best, Matthias [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics

Re: DataDog and Flink

2021-03-23 Thread Vishal Santoshi
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. Thanks and regards. On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi wr

Re: DataDog and Flink

2021-03-23 Thread Vishal Santoshi
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413 ( as in the chunk is greater then some size limi

Re: DataDog and Flink

2021-03-23 Thread Vishal Santoshi
If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued

Re: DataDog and Flink

2021-03-22 Thread Arvid Heise
Hi Vishal, I have no experience in the Flink+DataDog setup but worked a bit with DataDog before. I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues. Can you log into the TM's machine and try

DataDog and Flink

2021-03-20 Thread Vishal Santoshi
Hello folks, This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM for every DataDog dispatch time out with* java.net.SocketTimeoutException: timeout *and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 se