Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Sat, 13 Jul 2024 15:14:30 -0700

👍

On Sun, Jul 14, 2024 at 1:07 AM Holden Karau <holden.ka...@gmail.com> wrote:


> Thank you :)
>
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <gurwls...@apache.org> wrote:
>
>> Reverted, and opened a new one https://github.com/apache/spark/pull/47341
>> .
>>
>> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote:
>>
>>> Yeah that's fine. I'll revert and open a fresh PR including my own
>>> followup when I get back home later today.
>>>
>>> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com>
>>> wrote:
>>>
>>>> Even if the change is reasonable (and I can see arguments both ways),
>>>> it's important that we follow the process we agreed on. Merging a PR
>>>> without discussion* in ~ 2 hours from the initial proposal is not enough
>>>> time to reach a lazy consensus. If it was a small bug-fix I could
>>>> understand but this was a non-trivial change.
>>>>
>>>>
>>>> * It was approved by another committer but without any discussion, and
>>>> the approver & code author work for the same employer mentioned as the
>>>> justification for the change.
>>>>
>>>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org>
>>>> wrote:
>>>>
>>>>> I think we should have not mentioned a specific vendor there. The
>>>>> change also shouldn't repartition. We should create a partition 1.
>>>>>
>>>>> But in general leveraging Catalyst optimizer and SQL engine there is a
>>>>> good idea as we can leverage all optimization there. For example, it will
>>>>> use UTF8 encoding instead of a plan string ser/de. We made similar changes
>>>>> in JSON and CSV schema inference (it was an RDD before)
>>>>>
>>>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My bad I meant to say I believe the provided justification is
>>>>>> inappropriate.
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So looking at the PR it does not appear to be removing any RDD APIs
>>>>>>> but the justification provided for changing the ML backend to use the
>>>>>>> DataFrame APIs is indeed concerning.
>>>>>>>
>>>>>>> This PR appears to have been merged without proper review (or
>>>>>>> providing an opportunity for review).
>>>>>>>
>>>>>>> I’d like to remind people of the expectations we decided on together
>>>>>>> —
>>>>>>> https://spark.apache.org/committers.html
>>>>>>>
>>>>>>> I believe the provided justification for the change and would ask
>>>>>>> that we revert this PR so that a proper discussion can take place.
>>>>>>>
>>>>>>> “
>>>>>>> In databricks runtime, RDD read / write API has some issue for
>>>>>>> certain storage types that requires the account key, but Dataframe read 
>>>>>>> /
>>>>>>> write API works.
>>>>>>> “
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>>>>>> <mar...@databricks.com.invalid> wrote:
>>>>>>>
>>>>>>>> I took a quick look at the PR and would like to understand your
>>>>>>>> concern better about:
>>>>>>>>
>>>>>>>> >  SparkSession is heavier than SparkContext
>>>>>>>>
>>>>>>>> It looks like the PR is using the active SparkSession, not creating
>>>>>>>> a new one etc. I would highly appreciate it if you could help me 
>>>>>>>> understand
>>>>>>>> this situation better.
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>>
>>>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <
>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, All.
>>>>>>>>>
>>>>>>>>> Apache Spark's RDD API plays an essential and invaluable role from
>>>>>>>>> the beginning and it will be even if it's not supported by Spark 
>>>>>>>>> Connect.
>>>>>>>>>
>>>>>>>>> I have a concern about a recent activity which replaces RDD with
>>>>>>>>> SparkSession blindly.
>>>>>>>>>
>>>>>>>>> For instance,
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/pull/47328
>>>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>>>>>> Dataframe read / write API
>>>>>>>>>
>>>>>>>>> This PR doesn't look proper to me in two ways.
>>>>>>>>> - SparkSession is heavier than SparkContext
>>>>>>>>> - According to the following PR description, the background is
>>>>>>>>> also hidden in the community.
>>>>>>>>>
>>>>>>>>>   > # Why are the changes needed?
>>>>>>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>>>>>>> certain storage types
>>>>>>>>>   > that requires the account key, but Dataframe read / write API
>>>>>>>>> works.
>>>>>>>>>
>>>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>>>>>> storage's issue or not because it's not testable in the community test
>>>>>>>>> coverage.
>>>>>>>>>
>>>>>>>>> I'm wondering if the Apache Spark community aims to move away from
>>>>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because
>>>>>>>>> `Spark Connect` is not even GA in the community?
>>>>>>>>>
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to