Hi, All. Apache Spark's RDD API plays an essential and invaluable role from the beginning and it will be even if it's not supported by Spark Connect.
I have a concern about a recent activity which replaces RDD with SparkSession blindly. For instance, https://github.com/apache/spark/pull/47328 [SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API This PR doesn't look proper to me in two ways. - SparkSession is heavier than SparkContext - According to the following PR description, the background is also hidden in the community. > # Why are the changes needed? > In databricks runtime, RDD read / write API has some issue for certain storage types > that requires the account key, but Dataframe read / write API works. In addition, we don't know if this PR fixes the mentioned unknown storage's issue or not because it's not testable in the community test coverage. I'm wondering if the Apache Spark community aims to move away from the RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark Connect` is not even GA in the community? Dongjoon.