Datadog reporter timeout & OOM issue

Xingcan Cui Tue, 26 Jan 2021 20:53:08 -0800

Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined
metrics. Sometimes when reaching traffic peaks (which are also peaks for
metrics), the HTTP client will throw the following exception:


```
[OkHttp https://app.datadoghq.com/...] WARN
 org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending
request to Datadog
java.net.SocketTimeoutException: timeout
at
okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at
okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at
okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at
okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since
too many HTTP requests look like a kind of "attack". The real problem is
that after throwing the above exceptions, the JVM heap size of the
taskmanager starts to increase and finally causes OOM. I'm curious if this
may be caused by metrics accumulation, i.e., for some reason, the client
can't reconnect to the Datadog server and send the metrics so that the
metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan

Datadog reporter timeout & OOM issue

Reply via email to