Re: Datadog reporter timeout & OOM issue

Chesnay Schepler Wed, 27 Jan 2021 01:31:05 -0800

Yes, I could see how the memory issue can occur.

However, it should be limited to buffering 64 requests; this is thedefault limit that okhttp imposes on concurrent calls.

Maybe lowering this value already does the trick.


On 1/27/2021 5:52 AM, Xingcan Cui wrote:

Hi all,
Recently, I tried to use the Datadog reporter to collect someuser-defined metrics. Sometimes when reaching traffic peaks (which arealso peaks for metrics), the HTTP client will throw the followingexception:
```
[OkHttp https://app.datadoghq.com/.. <https://app.datadoghq.com/..>.]WARN org.apache.flink.metrics.datadog.DatadogHttpClient - Failedsending request to Datadog
java.net.SocketTimeoutException: timeout
atokhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)atokhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)atokhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)atokhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)atokhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)atokhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)atokhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
I guess this may be caused by the rate limit of the Datadog serversince too many HTTP requests look like a kind of "attack". The realproblem is that after throwing the above exceptions, the JVM heap sizeof the taskmanager starts to increase and finally causes OOM. I'mcurious if this may be caused by metrics accumulation, i.e., for somereason, the client can't reconnect to the Datadog server and send themetrics so that the metrics data is buffered in memory and causes OOM.
I'm running Flink 1.11.2 on EMR-6.2.0 withflink-metrics-datadog-1.11.2.jar.
Thanks,
Xingcan

Re: Datadog reporter timeout & OOM issue

Reply via email to