Damon Cortesi created SPARK-51613: ------------------------------------- Summary: Improve Spark Operator metrics Key: SPARK-51613 URL: https://issues.apache.org/jira/browse/SPARK-51613 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: kubernetes-operator-0.1.0 Reporter: Damon Cortesi
Today the Spark Operator provides JVM, Kubernetes, and Java Operator SDK metrics, but no metrics specific to the functionality and health of the Spark App or Cluster resources managed by the operator. It would be nice to have metrics like: * Total counts of Apps or Clusters by state (Submitted, Failed, Successful, etc) * Gauges of Apps or Clusters by state (Submitted, Pending, Running, etc) * Timers for Spark submit latency (Submission --> Running for example) * Potentially depth of the reconciliation backlog and how many apps are getting added per interval, although this may already be handled in the operator SDK metrics via reconciliations_queue_size In addition, it would be nice to have Prometheus metrics with labels, but it doesn't look like Dropwizard supports that (nor likely to happen via [https://github.com/dropwizard/metrics/issues/1272] ). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org