Hi all, Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:
``` [OkHttp https://app.datadoghq.com/...] WARN org.apache.flink.metrics.datadog.DatadogHttpClient - Failed sending request to Datadog java.net.SocketTimeoutException: timeout at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593) at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601) at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146) at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM. I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar. Thanks, Xingcan