I took a quick look at the PR and would like to understand your concern better about:
> SparkSession is heavier than SparkContext It looks like the PR is using the active SparkSession, not creating a new one etc. I would highly appreciate it if you could help me understand this situation better. Thanks a lot! On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, All. > > Apache Spark's RDD API plays an essential and invaluable role from the > beginning and it will be even if it's not supported by Spark Connect. > > I have a concern about a recent activity which replaces RDD with > SparkSession blindly. > > For instance, > > https://github.com/apache/spark/pull/47328 > [SPARK-48883][ML][R] Replace RDD read / write API invocation with > Dataframe read / write API > > This PR doesn't look proper to me in two ways. > - SparkSession is heavier than SparkContext > - According to the following PR description, the background is also hidden > in the community. > > > # Why are the changes needed? > > In databricks runtime, RDD read / write API has some issue for certain > storage types > > that requires the account key, but Dataframe read / write API works. > > In addition, we don't know if this PR fixes the mentioned unknown > storage's issue or not because it's not testable in the community test > coverage. > > I'm wondering if the Apache Spark community aims to move away from the RDD > usage in favor of `Spark Connect`. Isn't it too early because `Spark > Connect` is not even GA in the community? > > Dongjoon. >