Damon Cortesi created SPARK-51613:
-------------------------------------

             Summary: Improve Spark Operator metrics
                 Key: SPARK-51613
                 URL: https://issues.apache.org/jira/browse/SPARK-51613
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: kubernetes-operator-0.1.0
            Reporter: Damon Cortesi


Today the Spark Operator provides JVM, Kubernetes, and Java Operator SDK 
metrics, but no metrics specific to the functionality and health of the Spark 
App or Cluster resources managed by the operator. It would be nice to have 
metrics like:
 * Total counts of Apps or Clusters by state (Submitted, Failed, Successful, 
etc)
 * Gauges of Apps or Clusters by state (Submitted, Pending, Running, etc)
 * Timers for Spark submit latency (Submission --> Running for example)
 * Potentially depth of the reconciliation backlog and how many apps are 
getting added per interval, although this may already be handled in the 
operator SDK metrics via reconciliations_queue_size

In addition, it would be nice to have Prometheus metrics with labels, but it 
doesn't look like Dropwizard supports that (nor likely to happen via 
[https://github.com/dropwizard/metrics/issues/1272] ).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to