Robert Batts created FLINK-7200:
-----------------------------------

             Summary: Make metrics more Datadog friendly
                 Key: FLINK-7200
                 URL: https://issues.apache.org/jira/browse/FLINK-7200
             Project: Flink
          Issue Type: Improvement
          Components: Metrics
    Affects Versions: 1.3.1
            Reporter: Robert Batts
            Priority: Minor


The current output of the Datadog Reporter is a little unfriendly to the 
platform they are going to from a metrics name perspective. Take for example 
the metric used reporting with the Datadog Kafka integration.

kafka.consumer_lag=0000 [topic:xxxx, consumer_group: yyyy, partition: 0000]

Through the use of tags (in this case topic, consumer_group, and partition) you 
can create graphs in Datadog filtered to a specific topic and consumer_group 
and then averaged on each partition. This allows you to visualize something 
like a heatmap for lag on each partition for a consumer.

So what am I suggesting for Flink? Currently, I think the tags for Datadog are 
in a great place. Tags like job_id and subtask_id would be great for filtering 
and grouping. But, the metric name is currently too specific to a taskmanager 
and subtask. Currently, the metrics look something like this:

flink_w04.taskmanager.4f378aff5730.TwitterExample.ExtractHashtags.7.numRecordsOut
{host}.taskmanager.{tm_id}.{job_name}.{operator_name}.{subtask_index}.{metric_name}

What I am suggesting is something more like this:

taskmanager.TwitterExample.ExtractHashtags.numRecordsOut
taskmanager.{job_name}.{operator_name}.{metric_name}
(or even taskmanager.{metric_name}, but that would be a lot of tags on a single 
metric)

By doing this someone could create a graph on the numRecordsOut for an entire 
task's metric with a single metric in Datadog rather than combining the metric 
for every subtask_index using the tm_id metric (that could change if a tm_id 
dropped out of the cluster.) Additionally, given the current set of tags being 
output to Datadog there is a ton of grouping and filtering that will be 
available if everything was on a simplified metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to