Kimahriman commented on PR #50612: URL: https://github.com/apache/spark/pull/50612#issuecomment-2811589454
> @Kimahriman Thanks for submitting this change. I just took a quick look. Please can you share more on the motivation for this and your use case? I would like to understand the issue you observed, the type of stateful query you ran, the state store provider you used and your cluster setup. There's a little more information in [the jira issue](https://issues.apache.org/jira/browse/SPARK-51823). The quick background is we do relatively large streaming deduplications and streaming aggregations (total state size can be in the 10s to 100s of TiB) with up to 10s of thousands of partitions. We've been dealing with issues related to this for a long time, and over time some fixes come out to make this situation better, but at the end of the day they are mostly band-aids to this type of scenario. We use the RocksDB state store for most things, and use bounded memory to limit resource utilization. This is the result of finally digging into why some of our partitions were taking over an hour to create a RocksDB snapshot to upload. This led us to find a lot of things potentially contributing to this: - The level-0 cache is pinned for all opened RocksDB instances on an executor. This can easily be several hundred on a single executor, and all that memory can't be freed even when those instances are not being used. This could be fixed by not pinning the level-0 cache - There seemed to be contention for background compaction, as we would see the checkpoint process start, and then nothing happen for that partition for an hour, and then finally compaction kick in and the checkpoint successfully created. This could be improved by increasing background threads But at the end of the day these are all workarounds to the problem that the existing stateful streaming approach doesn't work well with high-latency, high-volume queries, it's more designed around low-latency, low-volume queries. Additionally, we use a dynamic allocation setup, so it is very likely most of our executors will be deallocated before the next batch runs, so keeping the state stores open does nothing but waste resources. This change would also help the HDFSBackedStateStore have more use cases again and help some people avoid the added complexity of using RocksDB just to deal with all the state stores being kept on a small number of executors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org