Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with
DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also
be odd that the other TMs with a similar rate still pass. So I'd suspect
n/w issues.
Can you log into the TM's machine and try out manually how the system
behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Hello folks,
>                   This is quite strange. We see a TM stop reporting
> metrics to DataDog .The logs from that specific TM  for every DataDog
> dispatch time out with* java.net.SocketTimeoutException: timeout *and
> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
> seconds cadence per container. The TM remains humming, so does not seem to
> be under memory/CPU distress. And the exception is *not* transient. It
> just stops dead and from there on timeout.
>
> Looking at SLA provided by DataDog any throttling exception should pretty
> much not be a SocketTimeOut, till of course the reporting the specific
> issue is off. This thus appears very much a n/w issue which appears weird
> as other TMs with the same n/w just hum along, sending their metrics
> successfully. The other issue could be just the amount of metrics and the
> current volume for the TM is prohibitive. That said the exception is still
> not helpful.
>
> Any ideas from folks who have used DataDog reporter with Flink. I guess
> even best practices may be a sufficient beginning.
>
> Regards.
>
>

Reply via email to