Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Holden Karau Fri, 12 Jul 2024 18:52:33 -0700

Even if the change is reasonable (and I can see arguments both ways), it's
important that we follow the process we agreed on. Merging a PR without
discussion* in ~ 2 hours from the initial proposal is not enough time to
reach a lazy consensus. If it was a small bug-fix I could understand but
this was a non-trivial change.



* It was approved by another committer but without any discussion, and the
approver & code author work for the same employer mentioned as the
justification for the change.

On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <[email protected]> wrote:

> I think we should have not mentioned a specific vendor there. The change
> also shouldn't repartition. We should create a partition 1.
>
> But in general leveraging Catalyst optimizer and SQL engine there is a
> good idea as we can leverage all optimization there. For example, it will
> use UTF8 encoding instead of a plan string ser/de. We made similar changes
> in JSON and CSV schema inference (it was an RDD before)
>
> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <[email protected]>
> wrote:
>
>> My bad I meant to say I believe the provided justification is
>> inappropriate.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <[email protected]>
>> wrote:
>>
>>> So looking at the PR it does not appear to be removing any RDD APIs but
>>> the justification provided for changing the ML backend to use the DataFrame
>>> APIs is indeed concerning.
>>>
>>> This PR appears to have been merged without proper review (or providing
>>> an opportunity for review).
>>>
>>> I’d like to remind people of the expectations we decided on together —
>>> https://spark.apache.org/committers.html
>>>
>>> I believe the provided justification for the change and would ask that
>>> we revert this PR so that a proper discussion can take place.
>>>
>>> “
>>> In databricks runtime, RDD read / write API has some issue for certain
>>> storage types that requires the account key, but Dataframe read / write API
>>> works.
>>> “
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>>
>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>> <[email protected]> wrote:
>>>
>>>> I took a quick look at the PR and would like to understand your concern
>>>> better about:
>>>>
>>>> >  SparkSession is heavier than SparkContext
>>>>
>>>> It looks like the PR is using the active SparkSession, not creating a
>>>> new one etc. I would highly appreciate it if you could help me understand
>>>> this situation better.
>>>>
>>>> Thanks a lot!
>>>>
>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Apache Spark's RDD API plays an essential and invaluable role from the
>>>>> beginning and it will be even if it's not supported by Spark Connect.
>>>>>
>>>>> I have a concern about a recent activity which replaces RDD with
>>>>> SparkSession blindly.
>>>>>
>>>>> For instance,
>>>>>
>>>>> https://github.com/apache/spark/pull/47328
>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>> Dataframe read / write API
>>>>>
>>>>> This PR doesn't look proper to me in two ways.
>>>>> - SparkSession is heavier than SparkContext
>>>>> - According to the following PR description, the background is also
>>>>> hidden in the community.
>>>>>
>>>>>   > # Why are the changes needed?
>>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>>> certain storage types
>>>>>   > that requires the account key, but Dataframe read / write API
>>>>> works.
>>>>>
>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>> storage's issue or not because it's not testable in the community test
>>>>> coverage.
>>>>>
>>>>> I'm wondering if the Apache Spark community aims to move away from the
>>>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>>>>> Connect` is not even GA in the community?
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to