If we look at this <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159> code , the metrics are divided into chunks up-to a max size. and enqueued <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>. The Request <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75> has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk
metrics.reporter.dghttp.maxMetricsPerRequest: 2000 We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive. Any thoughts? On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote: > Hi Vishal, > > I have no experience in the Flink+DataDog setup but worked a bit with > DataDog before. > I'd agree that the timeout does not seem like a rate limit. It would also > be odd that the other TMs with a similar rate still pass. So I'd suspect > n/w issues. > Can you log into the TM's machine and try out manually how the system > behaves? > > On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> Hello folks, >> This is quite strange. We see a TM stop reporting >> metrics to DataDog .The logs from that specific TM for every DataDog >> dispatch time out with* java.net.SocketTimeoutException: timeout *and >> that seems to repeat over every dispatch to DataDog. It seems it is on a 10 >> seconds cadence per container. The TM remains humming, so does not seem to >> be under memory/CPU distress. And the exception is *not* transient. It >> just stops dead and from there on timeout. >> >> Looking at SLA provided by DataDog any throttling exception should pretty >> much not be a SocketTimeOut, till of course the reporting the specific >> issue is off. This thus appears very much a n/w issue which appears weird >> as other TMs with the same n/w just hum along, sending their metrics >> successfully. The other issue could be just the amount of metrics and the >> current volume for the TM is prohibitive. That said the exception is still >> not helpful. >> >> Any ideas from folks who have used DataDog reporter with Flink. I guess >> even best practices may be a sufficient beginning. >> >> Regards. >> >>