Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Fri, 12 Jul 2024 20:23:19 -0700

I made a followup (https://github.com/apache/spark/pull/47341) to address
my own concerns. Please let me know if there are additional concerns. We
could further discuss it there.
I am also fine with reverting it and starting it from scratch if that's
preferred.



On Sat, 13 Jul 2024 at 11:52, Ruifeng Zheng <[email protected]> wrote:

> My bad, as the reviewer, I should review the PR description more closely.
>
> I think it is a good change to replace spark context based implementation
> with spark session, and if I recall correctly there were some similar
> attempts in MLLib in the past.
>
>
> On Sat, Jul 13, 2024 at 9:42 AM Hyukjin Kwon <[email protected]> wrote:
>
>> I think we should have not mentioned a specific vendor there. The change
>> also shouldn't repartition. We should create a partition 1.
>>
>> But in general leveraging Catalyst optimizer and SQL engine there is a
>> good idea as we can leverage all optimization there. For example, it will
>> use UTF8 encoding instead of a plan string ser/de. We made similar changes
>> in JSON and CSV schema inference (it was an RDD before)
>>
>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <[email protected]>
>> wrote:
>>
>>> My bad I meant to say I believe the provided justification is
>>> inappropriate.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> So looking at the PR it does not appear to be removing any RDD APIs but
>>>> the justification provided for changing the ML backend to use the DataFrame
>>>> APIs is indeed concerning.
>>>>
>>>> This PR appears to have been merged without proper review (or providing
>>>> an opportunity for review).
>>>>
>>>> I’d like to remind people of the expectations we decided on together —
>>>> https://spark.apache.org/committers.html
>>>>
>>>> I believe the provided justification for the change and would ask that
>>>> we revert this PR so that a proper discussion can take place.
>>>>
>>>> “
>>>> In databricks runtime, RDD read / write API has some issue for certain
>>>> storage types that requires the account key, but Dataframe read / write API
>>>> works.
>>>> “
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>>
>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>>> <[email protected]> wrote:
>>>>
>>>>> I took a quick look at the PR and would like to understand your
>>>>> concern better about:
>>>>>
>>>>> >  SparkSession is heavier than SparkContext
>>>>>
>>>>> It looks like the PR is using the active SparkSession, not creating a
>>>>> new one etc. I would highly appreciate it if you could help me understand
>>>>> this situation better.
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Apache Spark's RDD API plays an essential and invaluable role from
>>>>>> the beginning and it will be even if it's not supported by Spark Connect.
>>>>>>
>>>>>> I have a concern about a recent activity which replaces RDD with
>>>>>> SparkSession blindly.
>>>>>>
>>>>>> For instance,
>>>>>>
>>>>>> https://github.com/apache/spark/pull/47328
>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>>> Dataframe read / write API
>>>>>>
>>>>>> This PR doesn't look proper to me in two ways.
>>>>>> - SparkSession is heavier than SparkContext
>>>>>> - According to the following PR description, the background is also
>>>>>> hidden in the community.
>>>>>>
>>>>>>   > # Why are the changes needed?
>>>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>>>> certain storage types
>>>>>>   > that requires the account key, but Dataframe read / write API
>>>>>> works.
>>>>>>
>>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>>> storage's issue or not because it's not testable in the community test
>>>>>> coverage.
>>>>>>
>>>>>> I'm wondering if the Apache Spark community aims to move away from
>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because
>>>>>> `Spark Connect` is not even GA in the community?
>>>>>>
>>>>>> Dongjoon.
>>>>>>
>>>>>
>
> --
> Ruifeng Zheng
> E-mail: [email protected]
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to