[DISCUSS] Why do we remove RDD usage and RDD-backed code?

Dongjoon Hyun Fri, 12 Jul 2024 11:52:10 -0700

Hi, All.

Apache Spark's RDD API plays an essential and invaluable role from the
beginning and it will be even if it's not supported by Spark Connect.


I have a concern about a recent activity which replaces RDD with
SparkSession blindly.

For instance,

https://github.com/apache/spark/pull/47328
[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe
read / write API

This PR doesn't look proper to me in two ways.
- SparkSession is heavier than SparkContext
- According to the following PR description, the background is also hidden
in the community.

  > # Why are the changes needed?
  > In databricks runtime, RDD read / write API has some issue for certain
storage types
  > that requires the account key, but Dataframe read / write API works.

In addition, we don't know if this PR fixes the mentioned unknown storage's
issue or not because it's not testable in the community test coverage.

I'm wondering if the Apache Spark community aims to move away from the RDD
usage in favor of `Spark Connect`. Isn't it too early because `Spark
Connect` is not even GA in the community?

Dongjoon.

[DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to