Hi everyone!

We are currently working on building a unified monitoring/alerting solution
for Spark and would like to rely on Spark's own metrics to avoid divergence
from the upstream. One of the challenges is to support metrics coming from
multiple Spark applications running on a cluster: scheduled jobs,
long-running streaming applications etc.

Original problem:
Spark assigns metrics names using *spark.app.id <http://spark.app.id>*
and *spark.executor.id
<http://spark.executor.id>* as a part of them. Thus the number of metrics
is continuously growing because those IDs are unique between executions
whereas the metrics themselves report the same thing. Another issue which
arises here is how to use constantly changing metric names in dashboards.

For example, *jvm_heap_used* reported by all Spark instances (components):
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_<spark.executor.id>_jvm_heap_used (Executors)

While *spark.app.id <http://spark.app.id>* can be overridden with
*spark.metrics.namespace*, there's no such an option for *spark.executor.id
<http://spark.executor.id>* which makes it impossible to build a reusable
dashboard because (given the uniqueness of IDs) differently named metrics
are emitted for each execution.

One of the possible solutions would be to make executor metrics names
follow the driver's metrics name pattern, e.g.:
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_executor_jvm_heap_used (Executors)

and distinguish executors based on tags (tags should be configured in
metric reporters in this case). Not sure if this could potentially break
Driver UI though.

I'd really appreciate any feedback on this issue and would be happy to
create a Jira issue/PR if this change looks sane for the community.

Thanks in advance.

-- 
*Anton Kirillov*
Senior Software Engineer, Mesosphere

Reply via email to