Re: DataDog and Flink

Vishal Santoshi Wed, 24 Mar 2021 09:17:12 -0700

yep, not a single EP that does all the dump but something like this works (
dirty but who cares :)) ..  The vertex metrics are the most numerous any
way
```curl -s  http://xxxx/jobs/[job_id] | jq -r '.vertices' | jq
'.[].id' |  xargs
-I {}  curl http://xxxxxx/jobs/[job_id]/vertices/{}/metrics | jq


On Wed, Mar 24, 2021 at 9:56 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Yes, I will do that.
>
> Regarding the metrics dump through REST, it does provide for the TM
> specific but  refuses to do it for all jobs and vertices/operators etc
> .Moreover I am not sure I have access to the vertices ( vertex_id ) readily
> from the UI.
>
> curl http://[jm]/taskmanagers/[tm_id]
> curl http://[jm]/taskmanagers/[tm_id]/metrics
>
>
>
> On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Vishal,
>>
>> REST API is the most direct way to get through all metrics as Matthias
>> pointed out. Additionally, you could also add a JMX reporter and log to the
>> machines to check.
>>
>> But in general, I think you are on the right track. You need to reduce
>> the metrics that are sent to DD by configuring the scope / excluding
>> variables.
>>
>> Furthermore, I think it would be a good idea to make the timeout
>> configurable. Could you open a ticket for that?
>>
>> Best,
>>
>> Arvid
>>
>> On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <matth...@ververica.com>
>> wrote:
>>
>>> Hi Vishal,
>>> what about the TM metrics' REST endpoint [1]. Is this something you
>>> could use to get all the metrics for a specific TaskManager? Or are you
>>> looking for something else?
>>>
>>> Best,
>>> Matthias
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics
>>>
>>> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> That said, is there a way to get a dump of all metrics exposed by TM. I
>>>> was searching for it and I bet we could get it for ServieMonitor on k8s (
>>>> scrape ) but am missing a way to het a TM and dump all metrics that are
>>>> pushed.
>>>>
>>>> Thanks and regards.
>>>>
>>>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> I guess there is a bigger issue here. We dropped the property to 500.
>>>>> We also realized that this failure happened on a TM that had one specific
>>>>> job running on it. What was good ( but surprising ) that the exception was
>>>>> the more protocol specific 413  ( as in the chunk is greater then some 
>>>>> size
>>>>> limit DD has on a request.
>>>>>
>>>>> Failed to send request to Datadog (response was Response{protocol=h2,
>>>>> code=413, message=, url=
>>>>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>>>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>>>>> )
>>>>>
>>>>> which implies that the Socket timeout was masking this issue. The 2000
>>>>> was just a huge payload that DD was unable to parse in time ( or was slow
>>>>> to upload etc ). Now we could go lower but that makes less sense. We could
>>>>> play with
>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>>>>> to reduce the size of the tags ( or keys ).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> If we look at this
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>>>>> code , the metrics are divided into chunks up-to a max size. and
>>>>>> enqueued
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>>>>> The Request
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>>>>> has a 3 second read/connect/write timeout which IMHO should have been
>>>>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>>>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>>>>> along with tags ) , it may make sense to limit the number of metrics in a
>>>>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>>>>> this configuration which allows for reducing the metrics in a single 
>>>>>> chunk
>>>>>>
>>>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>>>>
>>>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>>>>> inevitable that the number of requests will grow and we may hit the
>>>>>> throttle but then we know the exception rather than the timeouts that are
>>>>>> generally less intuitive.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> I have no experience in the Flink+DataDog setup but worked a bit
>>>>>>> with DataDog before.
>>>>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>>>>> suspect n/w issues.
>>>>>>> Can you log into the TM's machine and try out manually how the
>>>>>>> system behaves?
>>>>>>>
>>>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello folks,
>>>>>>>>                   This is quite strange. We see a TM stop reporting
>>>>>>>> metrics to DataDog .The logs from that specific TM  for every
>>>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException:
>>>>>>>> timeout *and that seems to repeat over every dispatch to DataDog.
>>>>>>>> It seems it is on a 10 seconds cadence per container. The TM remains
>>>>>>>> humming, so does not seem to be under memory/CPU distress. And the
>>>>>>>> exception is *not* transient. It just stops dead and from there on
>>>>>>>> timeout.
>>>>>>>>
>>>>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>>>>> appears weird as other TMs with the same n/w just hum along, sending 
>>>>>>>> their
>>>>>>>> metrics successfully. The other issue could be just the amount of 
>>>>>>>> metrics
>>>>>>>> and the current volume for the TM is prohibitive. That said the 
>>>>>>>> exception
>>>>>>>> is still not helpful.
>>>>>>>>
>>>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>>>>> guess even best practices may be a sufficient beginning.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>

Re: DataDog and Flink

Reply via email to