I would like to provide some new information: 1. Spark 3.4.0 [SPARK-42277] has started using RocksDB as the default option for `spark.history.store.hybridStore.diskBackend`.
- Since Spark 3.4, Spark will use RocksDB store if `spark.history.store.hybridStore.enabled` is true. To restore the behavior before Spark 3.4, you can set `spark.history.store.hybridStore.diskBackend` to `LEVELDB`. 2. Spark 4.0.0 [SPARK-45351] has begun using RocksDB as the default option for `spark.shuffle.service.db.backend`. - Since Spark 4.0, `spark.shuffle.service.db.backend` is set to `ROCKSDB` by default which means Spark will use RocksDB store for shuffle service. To restore the behavior before Spark 4.0, you can set `spark.shuffle.service.db.backend` to `LEVELDB`. So for users who hadn't explicitly configured the aforementioned options to be `LEVELDB` before, the situations of data reconstruction or re-parsing have already existed. On 2025/06/09 01:08:05 Jungtaek Lim wrote: > Thanks for the valuable input. > > I think it's more about the case where upgrading would surprise the end > users. If we simply remove LevelDB from the next release, we will be > removing these intermediate data as well and enforcing them to rebuild > everything. 15 mins is probably not super long from the given volume, but > even a couple additional minutes could bring a negative sentiment if they > ever opened this before. > > Would enabling the hybrid store reduce the surprise? If then maybe we could > ask users to enable it, with assigning a bit more memory (+ 2g on SHS > process) if they didn't use the hybrid store. > > 2025년 6월 6일 (금) 오후 5:08, Cheng Pan <pan3...@gmail.com>님이 작성: > > > I think SHS only uses LevelDB/RocksDB to store intermediate data, > > supporting re-parsing to rebuild the cache should be fine enough. > > > > Also share my experience about using LevelDB/RocksDB for SHS, it seems > > LevelDB has native memory leak issues, at least for the SHS use case, I > > need to reboot the SHS for every two months to recover it, issue gone after > > upgrading to Spark 3.3 and switching to RocksDB. > > > > Scale and Performance: we keep ~800k applications event logs for the event > > log HDFS directory, multiple threads re-parsing to rebuild listing.rdb > > takes ~15mins. > > > > Thanks, > > Cheng Pan > > > > > > > > On Jun 6, 2025, at 15:36, Jungtaek Lim <kabhwan.opensou...@gmail.com> > > wrote: > > > > IMHO, it's probably dependent on how long the rewrite will take, from > > reading the event log. If loading the state from LevelDB and rewriting to > > RocksDB is quite much faster, then we may want to support this for a couple > > minor releases to not force users to lose their cache. If there is no such > > difference, it is probably good to gradually migrate them automatically via > > opt-in for a couple minor releases. In both cases, we can enforce migration > > (neither opt-in nor opt-out) after that period. > > > > On Fri, Jun 6, 2025 at 10:51 AM Jia Fan <fanjia1...@gmail.com> wrote: > > > >> This is indeed an issue at the moment. Personally, I haven't found a > >> proper way to migrate data from LevelDB to RocksDB, as their storage > >> structures are different. Should we wait until a reasonable migration > >> solution becomes available before moving forward with this? > >> > >> Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2025年5月28日周三 15:41写道: > >> > > >> > Thanks for initiating this. > >> > > >> > I wonder if we don't have any compatibility issue on every component - > >> SS area does not have an issue, but I don't quite remember if the history > >> server would be OK with this. What is the story of the migration if they > >> had been using leveldb? I guess it could be probably re-parsed, but do we > >> need to ask users to perform some manual work to do that? > >> > > >> > On Wed, May 28, 2025 at 2:27 PM Yang Jie <yangji...@apache.org> wrote: > >> >> > >> >> The project "org.fusesource.leveldbjni:leveldbjni" released its last > >> version 12 years ago, and its code repository was last updated 8 years ago. > >> Consequently, I believe it's challenging for us to receive ongoing > >> maintenance and support from this project. > >> >> > >> >> On the flip side, when developers implement new features related to > >> Spark code, they have become accustomed to using rocksdb instead of > >> leveldb. > >> >> > >> >> Furthermore, in Spark 4.0, support for leveldb was deprecated, and the > >> default implementation of the corresponding functionality was switched to > >> rocksdb. > >> >> > >> >> Given these factors, I support discontinuing support for leveldb. > >> >> > >> >> > >> >> Thanks > >> >> JIe Yang > >> >> > >> >> On 2025/05/27 08:26:06 Jia Fan wrote: > >> >> > Hi all, > >> >> > > >> >> > I'd like to start a discussion about removing LevelDB support from > >> Apache Spark. > >> >> > > >> >> > As noted in SPARK-44223( > >> https://issues.apache.org/jira/browse/SPARK-44223), > >> >> > LevelDB support was deprecated in Spark 4.0. It’s no longer actively > >> >> > maintained or widely used, and continuing to support it brings > >> >> > unnecessary maintenance and dependency complexity. > >> >> > > >> >> > A PR has been opened here to remove it entirely: > >> >> > https://github.com/apache/spark/pull/51027 > >> >> > > >> >> > WDYT? > >> >> > > >> >> > Best regards, > >> >> > Jia Fan > >> >> > > >> >> > --------------------------------------------------------------------- > >> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> >> > > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> >> > >> > > > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org