[ https://issues.apache.org/jira/browse/SPARK-51097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zeyu Chen updated SPARK-51097: ------------------------------ Description: We currently lack detailed visibility into state store level state maintenance in RocksDB. This limitation affects the ability to identify performance degradation issues behind maintenance tasks. To remediate this, we will introduce state store "instance" metrics to StreamingQueryProgress to track the latest snapshot version uploaded in RocksDB. This improvement addresses three challenges in observability: * Uneven partition starvation, where we need to identify partitions with slow state maintenance, * Finding missing snapshots across versions, so we minimize extensive replays during recovery, * Identify performance instability, such as gaining insights into snapshot upload patterns was: We currently lack detailed visibility into partition-level state maintenance in RocksDB. This limitation affects the ability to identify performance degradation issues behind maintenance tasks. To remediate this, we will add the partition-level metrics to StreamingQueryProgress to track the latest snapshot version uploaded in RocksDB. This improvement addresses three challenges in observability: * Uneven partition starvation, where we need to identify partitions with slow state maintenance, * Finding missing snapshots across versions, so we minimize extensive replays during recovery, * Identify performance instability, such as gaining insights into snapshot upload patterns > Adding state store level metrics for last uploaded snapshot version in RocksDB > ------------------------------------------------------------------------------ > > Key: SPARK-51097 > URL: https://issues.apache.org/jira/browse/SPARK-51097 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 4.0.0, 4.1.0 > Reporter: Zeyu Chen > Priority: Minor > Labels: pull-request-available > > We currently lack detailed visibility into state store level state > maintenance in RocksDB. This limitation affects the ability to identify > performance degradation issues behind maintenance tasks. > To remediate this, we will introduce state store "instance" metrics to > StreamingQueryProgress to track the latest snapshot version uploaded in > RocksDB. > This improvement addresses three challenges in observability: > * Uneven partition starvation, where we need to identify partitions with > slow state maintenance, > * Finding missing snapshots across versions, so we minimize extensive > replays during recovery, > * Identify performance instability, such as gaining insights into snapshot > upload patterns -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org