Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Fri, 12 Jul 2024 22:34:35 -0700

We actually get the active Spark session so it doesn't cause overhead. Also
even we create, it will create once which should be pretty trivial overhead.


I don't think we can deprecate RDD API IMHO in any event.

On Sat, Jul 13, 2024 at 1:30 PM Martin Grund <mar...@databricks.com.invalid>
wrote:

> Mridul, I really just wanted to understand the concern from Dongjoon. What
> you're pointing at is a slightly different concern. So what I see is the
> following:
>
> > [...] they can initialize a SparkContext and work with RDD api:
>
> The current PR uses a potentially optional value without checking that it
> is set. (Which is what would happen if you just have a SparkContext and no
> SparkSession).
>
> I understand that this can happen when someone creates a Spark job and
> uses no other Spark APIs to begin with. But in the context of using the
> current Spark ML implementation, is it actually possible to end up in this
> situation? I'm really just trying to understand the system's invariants.
>
> > [...] SparkSession is heavier than SparkContext
>
> Assuming that, for whatever reason, a SparkSession was created. Is there a
> downside to using it?
>
> Please see my questions as independent of the RDD API discussion itself,
> and I don't think this PR was even meant to be put in the context of any
> Spark Connect work.
>
> On Fri, Jul 12, 2024 at 11:58 PM Mridul Muralidharan <mri...@gmail.com>
> wrote:
>
>>
>> It is not necessary for users to create a SparkSession Martin - they can
>> initialize a SparkContext and work with RDD api: which would be what
>> Dongjoon is referring to IMO.
>>
>> Even after Spark Connect GA, I am not in favor of deprecating RDD Api at
>> least until we have parity between both (which we don’t have today), and we
>> have vetted this parity over the course of a few minor releases.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Fri, Jul 12, 2024 at 4:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Apache Spark's RDD API plays an essential and invaluable role from the
>>> beginning and it will be even if it's not supported by Spark Connect.
>>>
>>> I have a concern about a recent activity which replaces RDD with
>>> SparkSession blindly.
>>>
>>> For instance,
>>>
>>> https://github.com/apache/spark/pull/47328
>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>> Dataframe read / write API
>>>
>>> This PR doesn't look proper to me in two ways.
>>> - SparkSession is heavier than SparkContext
>>> - According to the following PR description, the background is also
>>> hidden in the community.
>>>
>>>   > # Why are the changes needed?
>>>   > In databricks runtime, RDD read / write API has some issue for
>>> certain storage types
>>>   > that requires the account key, but Dataframe read / write API works.
>>>
>>> In addition, we don't know if this PR fixes the mentioned unknown
>>> storage's issue or not because it's not testable in the community test
>>> coverage.
>>>
>>> I'm wondering if the Apache Spark community aims to move away from the
>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>>> Connect` is not even GA in the community?
>>>
>>>
>>> Dongjoon.
>>>
>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to