yep, not a single EP that does all the dump but something like this works ( dirty but who cares :)) .. The vertex metrics are the most numerous any way ```curl -s http://xxxx/jobs/[job_id] | jq -r '.vertices' | jq '.[].id' | xargs -I {} curl http://xxxxxx/jobs/[job_id]/vertices/{}/metrics | jq
On Wed, Mar 24, 2021 at 9:56 AM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Yes, I will do that. > > Regarding the metrics dump through REST, it does provide for the TM > specific but refuses to do it for all jobs and vertices/operators etc > .Moreover I am not sure I have access to the vertices ( vertex_id ) readily > from the UI. > > curl http://[jm]/taskmanagers/[tm_id] > curl http://[jm]/taskmanagers/[tm_id]/metrics > > > > On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <ar...@apache.org> wrote: > >> Hi Vishal, >> >> REST API is the most direct way to get through all metrics as Matthias >> pointed out. Additionally, you could also add a JMX reporter and log to the >> machines to check. >> >> But in general, I think you are on the right track. You need to reduce >> the metrics that are sent to DD by configuring the scope / excluding >> variables. >> >> Furthermore, I think it would be a good idea to make the timeout >> configurable. Could you open a ticket for that? >> >> Best, >> >> Arvid >> >> On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <matth...@ververica.com> >> wrote: >> >>> Hi Vishal, >>> what about the TM metrics' REST endpoint [1]. Is this something you >>> could use to get all the metrics for a specific TaskManager? Or are you >>> looking for something else? >>> >>> Best, >>> Matthias >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics >>> >>> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> That said, is there a way to get a dump of all metrics exposed by TM. I >>>> was searching for it and I bet we could get it for ServieMonitor on k8s ( >>>> scrape ) but am missing a way to het a TM and dump all metrics that are >>>> pushed. >>>> >>>> Thanks and regards. >>>> >>>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> I guess there is a bigger issue here. We dropped the property to 500. >>>>> We also realized that this failure happened on a TM that had one specific >>>>> job running on it. What was good ( but surprising ) that the exception was >>>>> the more protocol specific 413 ( as in the chunk is greater then some >>>>> size >>>>> limit DD has on a request. >>>>> >>>>> Failed to send request to Datadog (response was Response{protocol=h2, >>>>> code=413, message=, url= >>>>> https://app.datadoghq.com/api/v1/series?api_key=**********} >>>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D> >>>>> ) >>>>> >>>>> which implies that the Socket timeout was masking this issue. The 2000 >>>>> was just a huge payload that DD was unable to parse in time ( or was slow >>>>> to upload etc ). Now we could go lower but that makes less sense. We could >>>>> play with >>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope >>>>> to reduce the size of the tags ( or keys ). >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> If we look at this >>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159> >>>>>> code , the metrics are divided into chunks up-to a max size. and >>>>>> enqueued >>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>. >>>>>> The Request >>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75> >>>>>> has a 3 second read/connect/write timeout which IMHO should have been >>>>>> configurable ( or is it ) . While the number metrics ( all metrics ) >>>>>> exposed by flink cluster is pretty high ( and the names of the metrics >>>>>> along with tags ) , it may make sense to limit the number of metrics in a >>>>>> single chunk ( to ultimately limit the size of a single chunk ). There is >>>>>> this configuration which allows for reducing the metrics in a single >>>>>> chunk >>>>>> >>>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000 >>>>>> >>>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any >>>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is >>>>>> inevitable that the number of requests will grow and we may hit the >>>>>> throttle but then we know the exception rather than the timeouts that are >>>>>> generally less intuitive. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> I have no experience in the Flink+DataDog setup but worked a bit >>>>>>> with DataDog before. >>>>>>> I'd agree that the timeout does not seem like a rate limit. It would >>>>>>> also be odd that the other TMs with a similar rate still pass. So I'd >>>>>>> suspect n/w issues. >>>>>>> Can you log into the TM's machine and try out manually how the >>>>>>> system behaves? >>>>>>> >>>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello folks, >>>>>>>> This is quite strange. We see a TM stop reporting >>>>>>>> metrics to DataDog .The logs from that specific TM for every >>>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException: >>>>>>>> timeout *and that seems to repeat over every dispatch to DataDog. >>>>>>>> It seems it is on a 10 seconds cadence per container. The TM remains >>>>>>>> humming, so does not seem to be under memory/CPU distress. And the >>>>>>>> exception is *not* transient. It just stops dead and from there on >>>>>>>> timeout. >>>>>>>> >>>>>>>> Looking at SLA provided by DataDog any throttling exception should >>>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the >>>>>>>> specific issue is off. This thus appears very much a n/w issue which >>>>>>>> appears weird as other TMs with the same n/w just hum along, sending >>>>>>>> their >>>>>>>> metrics successfully. The other issue could be just the amount of >>>>>>>> metrics >>>>>>>> and the current volume for the TM is prohibitive. That said the >>>>>>>> exception >>>>>>>> is still not helpful. >>>>>>>> >>>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I >>>>>>>> guess even best practices may be a sufficient beginning. >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>>