Hi everyone! We are currently working on building a unified monitoring/alerting solution for Spark and would like to rely on Spark's own metrics to avoid divergence from the upstream. One of the challenges is to support metrics coming from multiple Spark applications running on a cluster: scheduled jobs, long-running streaming applications etc.
Original problem: Spark assigns metrics names using *spark.app.id <http://spark.app.id>* and *spark.executor.id <http://spark.executor.id>* as a part of them. Thus the number of metrics is continuously growing because those IDs are unique between executions whereas the metrics themselves report the same thing. Another issue which arises here is how to use constantly changing metric names in dashboards. For example, *jvm_heap_used* reported by all Spark instances (components): - <spark.app.id>_driver_jvm_heap_used (Driver) - <spark.app.id>_<spark.executor.id>_jvm_heap_used (Executors) While *spark.app.id <http://spark.app.id>* can be overridden with *spark.metrics.namespace*, there's no such an option for *spark.executor.id <http://spark.executor.id>* which makes it impossible to build a reusable dashboard because (given the uniqueness of IDs) differently named metrics are emitted for each execution. One of the possible solutions would be to make executor metrics names follow the driver's metrics name pattern, e.g.: - <spark.app.id>_driver_jvm_heap_used (Driver) - <spark.app.id>_executor_jvm_heap_used (Executors) and distinguish executors based on tags (tags should be configured in metric reporters in this case). Not sure if this could potentially break Driver UI though. I'd really appreciate any feedback on this issue and would be happy to create a Jira issue/PR if this change looks sane for the community. Thanks in advance. -- *Anton Kirillov* Senior Software Engineer, Mesosphere