Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Holden Karau Fri, 12 Jul 2024 17:15:34 -0700

So looking at the PR it does not appear to be removing any RDD APIs but the
justification provided for changing the ML backend to use the DataFrame
APIs is indeed concerning.


This PR appears to have been merged without proper review (or providing an
opportunity for review).

I’d like to remind people of the expectations we decided on together —
https://spark.apache.org/committers.html

I believe the provided justification for the change and would ask that we
revert this PR so that a proper discussion can take place.

“
In databricks runtime, RDD read / write API has some issue for certain
storage types that requires the account key, but Dataframe read / write API
works.
“

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Fri, Jul 12, 2024 at 1:02 PM Martin Grund <mar...@databricks.com.invalid>
wrote:

> I took a quick look at the PR and would like to understand your concern
> better about:
>
> >  SparkSession is heavier than SparkContext
>
> It looks like the PR is using the active SparkSession, not creating a new
> one etc. I would highly appreciate it if you could help me understand this
> situation better.
>
> Thanks a lot!
>
> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Apache Spark's RDD API plays an essential and invaluable role from the
>> beginning and it will be even if it's not supported by Spark Connect.
>>
>> I have a concern about a recent activity which replaces RDD with
>> SparkSession blindly.
>>
>> For instance,
>>
>> https://github.com/apache/spark/pull/47328
>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>> Dataframe read / write API
>>
>> This PR doesn't look proper to me in two ways.
>> - SparkSession is heavier than SparkContext
>> - According to the following PR description, the background is also
>> hidden in the community.
>>
>>   > # Why are the changes needed?
>>   > In databricks runtime, RDD read / write API has some issue for
>> certain storage types
>>   > that requires the account key, but Dataframe read / write API works.
>>
>> In addition, we don't know if this PR fixes the mentioned unknown
>> storage's issue or not because it's not testable in the community test
>> coverage.
>>
>> I'm wondering if the Apache Spark community aims to move away from the
>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>> Connect` is not even GA in the community?
>>
>> Dongjoon.
>>
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to