Re: Datadog reporter timeout & OOM issue

Xingcan Cui Wed, 27 Jan 2021 06:21:26 -0800

Hi Juha and Chesnay,

I do appreciate your prompt responses! I'll also continue to investigate
this issue.


Best,
Xingcan

On Wed, Jan 27, 2021, 04:32 Chesnay Schepler <ches...@apache.org> wrote:

> (setting this field is currently not possible from a Flink user
> perspective; it is something I will investigate)
>
>
> On 1/27/2021 10:30 AM, Chesnay Schepler wrote:
>
> Yes, I could see how the memory issue can occur.
>
> However, it should be limited to buffering 64 requests; this is the
> default limit that okhttp imposes on concurrent calls.
> Maybe lowering this value already does the trick.
>
> On 1/27/2021 5:52 AM, Xingcan Cui wrote:
>
> Hi all,
>
> Recently, I tried to use the Datadog reporter to collect some user-defined
> metrics. Sometimes when reaching traffic peaks (which are also peaks for
> metrics), the HTTP client will throw the following exception:
>
> ```
> [OkHttp https://app.datadoghq.com/...] WARN
>  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending
> request to Datadog
> java.net.SocketTimeoutException: timeout
> at
> okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
> at
> okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
> at
> okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
> at
> okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
> at
> okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> ```
>
> I guess this may be caused by the rate limit of the Datadog server since
> too many HTTP requests look like a kind of "attack". The real problem is
> that after throwing the above exceptions, the JVM heap size of the
> taskmanager starts to increase and finally causes OOM. I'm curious if this
> may be caused by metrics accumulation, i.e., for some reason, the client
> can't reconnect to the Datadog server and send the metrics so that the
> metrics data is buffered in memory and causes OOM.
>
> I'm running Flink 1.11.2 on EMR-6.2.0 with
> flink-metrics-datadog-1.11.2.jar.
>
> Thanks,
> Xingcan
>
>
>
>

Re: Datadog reporter timeout & OOM issue

Reply via email to