Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Martin Grund Fri, 12 Jul 2024 21:30:37 -0700

Mridul, I really just wanted to understand the concern from Dongjoon. What
you're pointing at is a slightly different concern. So what I see is the
following:


> [...] they can initialize a SparkContext and work with RDD api:

The current PR uses a potentially optional value without checking that it
is set. (Which is what would happen if you just have a SparkContext and no
SparkSession).

I understand that this can happen when someone creates a Spark job and uses
no other Spark APIs to begin with. But in the context of using the
current Spark ML implementation, is it actually possible to end up in this
situation? I'm really just trying to understand the system's invariants.

> [...] SparkSession is heavier than SparkContext

Assuming that, for whatever reason, a SparkSession was created. Is there a
downside to using it?

Please see my questions as independent of the RDD API discussion itself,
and I don't think this PR was even meant to be put in the context of any
Spark Connect work.

On Fri, Jul 12, 2024 at 11:58 PM Mridul Muralidharan <mri...@gmail.com>
wrote:

>
> It is not necessary for users to create a SparkSession Martin - they can
> initialize a SparkContext and work with RDD api: which would be what
> Dongjoon is referring to IMO.
>
> Even after Spark Connect GA, I am not in favor of deprecating RDD Api at
> least until we have parity between both (which we don’t have today), and we
> have vetted this parity over the course of a few minor releases.
>
>
> Regards,
> Mridul
>
>
>
> On Fri, Jul 12, 2024 at 4:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Apache Spark's RDD API plays an essential and invaluable role from the
>> beginning and it will be even if it's not supported by Spark Connect.
>>
>> I have a concern about a recent activity which replaces RDD with
>> SparkSession blindly.
>>
>> For instance,
>>
>> https://github.com/apache/spark/pull/47328
>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>> Dataframe read / write API
>>
>> This PR doesn't look proper to me in two ways.
>> - SparkSession is heavier than SparkContext
>> - According to the following PR description, the background is also
>> hidden in the community.
>>
>>   > # Why are the changes needed?
>>   > In databricks runtime, RDD read / write API has some issue for
>> certain storage types
>>   > that requires the account key, but Dataframe read / write API works.
>>
>> In addition, we don't know if this PR fixes the mentioned unknown
>> storage's issue or not because it's not testable in the community test
>> coverage.
>>
>> I'm wondering if the Apache Spark community aims to move away from the
>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>> Connect` is not even GA in the community?
>>
>>
>> Dongjoon.
>>
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to