Re: [DISCUSS] Dropping LevelDB support in Spark

Yang Jie Sun, 08 Jun 2025 23:02:06 -0700

I would like to provide some new information:

1. Spark 3.4.0 [SPARK-42277] has started using RocksDB as the default option 
for `spark.history.store.hybridStore.diskBackend`.


- Since Spark 3.4, Spark will use RocksDB store if 
`spark.history.store.hybridStore.enabled` is true. To restore the behavior 
before Spark 3.4, you can set `spark.history.store.hybridStore.diskBackend` to 
`LEVELDB`.

2. Spark 4.0.0 [SPARK-45351] has begun using RocksDB as the default option for 
`spark.shuffle.service.db.backend`.

- Since Spark 4.0, `spark.shuffle.service.db.backend` is set to `ROCKSDB` by 
default which means Spark will use RocksDB store for shuffle service. To 
restore the behavior before Spark 4.0, you can set 
`spark.shuffle.service.db.backend` to `LEVELDB`.

So for users who hadn't explicitly configured the aforementioned options to be 
`LEVELDB` before, the situations of data reconstruction or re-parsing have 
already existed.

On 2025/06/09 01:08:05 Jungtaek Lim wrote:
> Thanks for the valuable input.
> 
> I think it's more about the case where upgrading would surprise the end
> users. If we simply remove LevelDB from the next release, we will be
> removing these intermediate data as well and enforcing them to rebuild
> everything. 15 mins is probably not super long from the given volume, but
> even a couple additional minutes could bring a negative sentiment if they
> ever opened this before.
> 
> Would enabling the hybrid store reduce the surprise? If then maybe we could
> ask users to enable it, with assigning a bit more memory (+ 2g on SHS
> process) if they didn't use the hybrid store.
> 
> 2025년 6월 6일 (금) 오후 5:08, Cheng Pan <[email protected]>님이 작성:
> 
> > I think SHS only uses LevelDB/RocksDB to store intermediate data,
> > supporting re-parsing to rebuild the cache should be fine enough.
> >
> > Also share my experience about using LevelDB/RocksDB for SHS, it seems
> > LevelDB has native memory leak issues, at least for the SHS use case, I
> > need to reboot the SHS for every two months to recover it, issue gone after
> > upgrading to Spark 3.3 and switching to RocksDB.
> >
> > Scale and Performance: we keep ~800k applications event logs for the event
> > log HDFS directory, multiple threads re-parsing to rebuild listing.rdb
> > takes ~15mins.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
> > On Jun 6, 2025, at 15:36, Jungtaek Lim <[email protected]>
> > wrote:
> >
> > IMHO, it's probably dependent on how long the rewrite will take, from
> > reading the event log. If loading the state from LevelDB and rewriting to
> > RocksDB is quite much faster, then we may want to support this for a couple
> > minor releases to not force users to lose their cache. If there is no such
> > difference, it is probably good to gradually migrate them automatically via
> > opt-in for a couple minor releases. In both cases, we can enforce migration
> > (neither opt-in nor opt-out) after that period.
> >
> > On Fri, Jun 6, 2025 at 10:51 AM Jia Fan <[email protected]> wrote:
> >
> >> This is indeed an issue at the moment. Personally, I haven't found a
> >> proper way to migrate data from LevelDB to RocksDB, as their storage
> >> structures are different. Should we wait until a reasonable migration
> >> solution becomes available before moving forward with this?
> >>
> >> Jungtaek Lim <[email protected]> 于2025年5月28日周三 15:41写道：
> >> >
> >> > Thanks for initiating this.
> >> >
> >> > I wonder if we don't have any compatibility issue on every component -
> >> SS area does not have an issue, but I don't quite remember if the history
> >> server would be OK with this. What is the story of the migration if they
> >> had been using leveldb? I guess it could be probably re-parsed, but do we
> >> need to ask users to perform some manual work to do that?
> >> >
> >> > On Wed, May 28, 2025 at 2:27 PM Yang Jie <[email protected]> wrote:
> >> >>
> >> >> The project "org.fusesource.leveldbjni:leveldbjni" released its last
> >> version 12 years ago, and its code repository was last updated 8 years ago.
> >> Consequently, I believe it's challenging for us to receive ongoing
> >> maintenance and support from this project.
> >> >>
> >> >> On the flip side, when developers implement new features related to
> >> Spark code, they have become accustomed to using rocksdb instead of 
> >> leveldb.
> >> >>
> >> >> Furthermore, in Spark 4.0, support for leveldb was deprecated, and the
> >> default implementation of the corresponding functionality was switched to
> >> rocksdb.
> >> >>
> >> >> Given these factors, I support discontinuing support for leveldb.
> >> >>
> >> >>
> >> >> Thanks
> >> >> JIe Yang
> >> >>
> >> >> On 2025/05/27 08:26:06 Jia Fan wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I'd like to start a discussion about removing LevelDB support from
> >> Apache Spark.
> >> >> >
> >> >> > As noted in SPARK-44223(
> >> https://issues.apache.org/jira/browse/SPARK-44223),
> >> >> > LevelDB support was deprecated in Spark 4.0. It’s no longer actively
> >> >> > maintained or widely used, and continuing to support it brings
> >> >> > unnecessary maintenance and dependency complexity.
> >> >> >
> >> >> > A PR has been opened here to remove it entirely:
> >> >> > https://github.com/apache/spark/pull/51027
> >> >> >
> >> >> > WDYT?
> >> >> >
> >> >> > Best regards,
> >> >> > Jia Fan
> >> >> >
> >> >> > ---------------------------------------------------------------------
> >> >> > To unsubscribe e-mail: [email protected]
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe e-mail: [email protected]
> >> >>
> >>
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] Dropping LevelDB support in Spark

Reply via email to