Hi Vishal, what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else?
Best, Matthias [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > That said, is there a way to get a dump of all metrics exposed by TM. I > was searching for it and I bet we could get it for ServieMonitor on k8s ( > scrape ) but am missing a way to het a TM and dump all metrics that are > pushed. > > Thanks and regards. > > On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> I guess there is a bigger issue here. We dropped the property to 500. We >> also realized that this failure happened on a TM that had one specific job >> running on it. What was good ( but surprising ) that the exception was the >> more protocol specific 413 ( as in the chunk is greater then some size >> limit DD has on a request. >> >> Failed to send request to Datadog (response was Response{protocol=h2, >> code=413, message=, url= >> https://app.datadoghq.com/api/v1/series?api_key=**********} >> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D> >> ) >> >> which implies that the Socket timeout was masking this issue. The 2000 >> was just a huge payload that DD was unable to parse in time ( or was slow >> to upload etc ). Now we could go lower but that makes less sense. We could >> play with >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope >> to reduce the size of the tags ( or keys ). >> >> >> >> >> >> >> >> >> >> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> If we look at this >>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159> >>> code , the metrics are divided into chunks up-to a max size. and >>> enqueued >>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>. >>> The Request >>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75> >>> has a 3 second read/connect/write timeout which IMHO should have been >>> configurable ( or is it ) . While the number metrics ( all metrics ) >>> exposed by flink cluster is pretty high ( and the names of the metrics >>> along with tags ) , it may make sense to limit the number of metrics in a >>> single chunk ( to ultimately limit the size of a single chunk ). There is >>> this configuration which allows for reducing the metrics in a single chunk >>> >>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000 >>> >>> We could decrease this to 1500 ( 1500 is pretty, not based on any >>> empirical reasoning ) and see if that stabilizes the dispatch. It is >>> inevitable that the number of requests will grow and we may hit the >>> throttle but then we know the exception rather than the timeouts that are >>> generally less intuitive. >>> >>> Any thoughts? >>> >>> >>> >>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote: >>> >>>> Hi Vishal, >>>> >>>> I have no experience in the Flink+DataDog setup but worked a bit with >>>> DataDog before. >>>> I'd agree that the timeout does not seem like a rate limit. It would >>>> also be odd that the other TMs with a similar rate still pass. So I'd >>>> suspect n/w issues. >>>> Can you log into the TM's machine and try out manually how the system >>>> behaves? >>>> >>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> Hello folks, >>>>> This is quite strange. We see a TM stop reporting >>>>> metrics to DataDog .The logs from that specific TM for every DataDog >>>>> dispatch time out with* java.net.SocketTimeoutException: timeout *and >>>>> that seems to repeat over every dispatch to DataDog. It seems it is on a >>>>> 10 >>>>> seconds cadence per container. The TM remains humming, so does not seem to >>>>> be under memory/CPU distress. And the exception is *not* transient. >>>>> It just stops dead and from there on timeout. >>>>> >>>>> Looking at SLA provided by DataDog any throttling exception should >>>>> pretty much not be a SocketTimeOut, till of course the reporting the >>>>> specific issue is off. This thus appears very much a n/w issue which >>>>> appears weird as other TMs with the same n/w just hum along, sending their >>>>> metrics successfully. The other issue could be just the amount of metrics >>>>> and the current volume for the TM is prohibitive. That said the exception >>>>> is still not helpful. >>>>> >>>>> Any ideas from folks who have used DataDog reporter with Flink. I >>>>> guess even best practices may be a sufficient beginning. >>>>> >>>>> Regards. >>>>> >>>>>