Hi Vishal, I have no experience in the Flink+DataDog setup but worked a bit with DataDog before. I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues. Can you log into the TM's machine and try out manually how the system behaves?
On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Hello folks, > This is quite strange. We see a TM stop reporting > metrics to DataDog .The logs from that specific TM for every DataDog > dispatch time out with* java.net.SocketTimeoutException: timeout *and > that seems to repeat over every dispatch to DataDog. It seems it is on a 10 > seconds cadence per container. The TM remains humming, so does not seem to > be under memory/CPU distress. And the exception is *not* transient. It > just stops dead and from there on timeout. > > Looking at SLA provided by DataDog any throttling exception should pretty > much not be a SocketTimeOut, till of course the reporting the specific > issue is off. This thus appears very much a n/w issue which appears weird > as other TMs with the same n/w just hum along, sending their metrics > successfully. The other issue could be just the amount of metrics and the > current volume for the TM is prohibitive. That said the exception is still > not helpful. > > Any ideas from folks who have used DataDog reporter with Flink. I guess > even best practices may be a sufficient beginning. > > Regards. > >