Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Holden Karau Fri, 12 Jul 2024 17:16:09 -0700

My bad I meant to say I believe the provided justification is inappropriate.


Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <[email protected]> wrote:

> So looking at the PR it does not appear to be removing any RDD APIs but
> the justification provided for changing the ML backend to use the DataFrame
> APIs is indeed concerning.
>
> This PR appears to have been merged without proper review (or providing an
> opportunity for review).
>
> I’d like to remind people of the expectations we decided on together —
> https://spark.apache.org/committers.html
>
> I believe the provided justification for the change and would ask that we
> revert this PR so that a proper discussion can take place.
>
> “
> In databricks runtime, RDD read / write API has some issue for certain
> storage types that requires the account key, but Dataframe read / write API
> works.
> “
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund <[email protected]>
> wrote:
>
>> I took a quick look at the PR and would like to understand your concern
>> better about:
>>
>> >  SparkSession is heavier than SparkContext
>>
>> It looks like the PR is using the active SparkSession, not creating a new
>> one etc. I would highly appreciate it if you could help me understand this
>> situation better.
>>
>> Thanks a lot!
>>
>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Apache Spark's RDD API plays an essential and invaluable role from the
>>> beginning and it will be even if it's not supported by Spark Connect.
>>>
>>> I have a concern about a recent activity which replaces RDD with
>>> SparkSession blindly.
>>>
>>> For instance,
>>>
>>> https://github.com/apache/spark/pull/47328
>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>> Dataframe read / write API
>>>
>>> This PR doesn't look proper to me in two ways.
>>> - SparkSession is heavier than SparkContext
>>> - According to the following PR description, the background is also
>>> hidden in the community.
>>>
>>>   > # Why are the changes needed?
>>>   > In databricks runtime, RDD read / write API has some issue for
>>> certain storage types
>>>   > that requires the account key, but Dataframe read / write API works.
>>>
>>> In addition, we don't know if this PR fixes the mentioned unknown
>>> storage's issue or not because it's not testable in the community test
>>> coverage.
>>>
>>> I'm wondering if the Apache Spark community aims to move away from the
>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>>> Connect` is not even GA in the community?
>>>
>>> Dongjoon.
>>>
>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to