Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Martin Grund Fri, 12 Jul 2024 13:02:14 -0700

I took a quick look at the PR and would like to understand your concern
better about:


>  SparkSession is heavier than SparkContext

It looks like the PR is using the active SparkSession, not creating a new
one etc. I would highly appreciate it if you could help me understand this
situation better.

Thanks a lot!

On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, All.
>
> Apache Spark's RDD API plays an essential and invaluable role from the
> beginning and it will be even if it's not supported by Spark Connect.
>
> I have a concern about a recent activity which replaces RDD with
> SparkSession blindly.
>
> For instance,
>
> https://github.com/apache/spark/pull/47328
> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
> Dataframe read / write API
>
> This PR doesn't look proper to me in two ways.
> - SparkSession is heavier than SparkContext
> - According to the following PR description, the background is also hidden
> in the community.
>
>   > # Why are the changes needed?
>   > In databricks runtime, RDD read / write API has some issue for certain
> storage types
>   > that requires the account key, but Dataframe read / write API works.
>
> In addition, we don't know if this PR fixes the mentioned unknown
> storage's issue or not because it's not testable in the community test
> coverage.
>
> I'm wondering if the Apache Spark community aims to move away from the RDD
> usage in favor of `Spark Connect`. Isn't it too early because `Spark
> Connect` is not even GA in the community?
>
> Dongjoon.
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to