Re: DataDog and Flink

Arvid Heise Wed, 24 Mar 2021 01:24:46 -0700

Hi Vishal,

REST API is the most direct way to get through all metrics as Matthias
pointed out. Additionally, you could also add a JMX reporter and log to the
machines to check.


But in general, I think you are on the right track. You need to reduce the
metrics that are sent to DD by configuring the scope / excluding variables.

Furthermore, I think it would be a good idea to make the timeout
configurable. Could you open a ticket for that?

Best,

Arvid

On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <[email protected]>
wrote:

> Hi Vishal,
> what about the TM metrics' REST endpoint [1]. Is this something you could
> use to get all the metrics for a specific TaskManager? Or are you looking
> for something else?
>
> Best,
> Matthias
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics
>
> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <
> [email protected]> wrote:
>
>> That said, is there a way to get a dump of all metrics exposed by TM. I
>> was searching for it and I bet we could get it for ServieMonitor on k8s (
>> scrape ) but am missing a way to het a TM and dump all metrics that are
>> pushed.
>>
>> Thanks and regards.
>>
>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <
>> [email protected]> wrote:
>>
>>> I guess there is a bigger issue here. We dropped the property to 500. We
>>> also realized that this failure happened on a TM that had one specific job
>>> running on it. What was good ( but surprising ) that the exception was the
>>> more protocol specific 413  ( as in the chunk is greater then some size
>>> limit DD has on a request.
>>>
>>> Failed to send request to Datadog (response was Response{protocol=h2,
>>> code=413, message=, url=
>>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>>> )
>>>
>>> which implies that the Socket timeout was masking this issue. The 2000
>>> was just a huge payload that DD was unable to parse in time ( or was slow
>>> to upload etc ). Now we could go lower but that makes less sense. We could
>>> play with
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>>> to reduce the size of the tags ( or keys ).
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>>> [email protected]> wrote:
>>>
>>>> If we look at this
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>>> code , the metrics are divided into chunks up-to a max size. and
>>>> enqueued
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>>> The Request
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>>> has a 3 second read/connect/write timeout which IMHO should have been
>>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>>> along with tags ) , it may make sense to limit the number of metrics in a
>>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>>> this configuration which allows for reducing the metrics in a single chunk
>>>>
>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>>
>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>>> inevitable that the number of requests will grow and we may hit the
>>>> throttle but then we know the exception rather than the timeouts that are
>>>> generally less intuitive.
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>
>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[email protected]> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> I have no experience in the Flink+DataDog setup but worked a bit with
>>>>> DataDog before.
>>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>>> suspect n/w issues.
>>>>> Can you log into the TM's machine and try out manually how the system
>>>>> behaves?
>>>>>
>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello folks,
>>>>>>                   This is quite strange. We see a TM stop reporting
>>>>>> metrics to DataDog .The logs from that specific TM  for every
>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException:
>>>>>> timeout *and that seems to repeat over every dispatch to DataDog. It
>>>>>> seems it is on a 10 seconds cadence per container. The TM remains 
>>>>>> humming,
>>>>>> so does not seem to be under memory/CPU distress. And the exception is
>>>>>> *not* transient. It just stops dead and from there on timeout.
>>>>>>
>>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>>> appears weird as other TMs with the same n/w just hum along, sending 
>>>>>> their
>>>>>> metrics successfully. The other issue could be just the amount of metrics
>>>>>> and the current volume for the TM is prohibitive. That said the exception
>>>>>> is still not helpful.
>>>>>>
>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>>> guess even best practices may be a sufficient beginning.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>

Re: DataDog and Flink

Reply via email to