Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Fri, 12 Jul 2024 18:42:02 -0700

I think we should have not mentioned a specific vendor there. The change
also shouldn't repartition. We should create a partition 1.


But in general leveraging Catalyst optimizer and SQL engine there is a good
idea as we can leverage all optimization there. For example, it will use
UTF8 encoding instead of a plan string ser/de. We made similar changes in
JSON and CSV schema inference (it was an RDD before)

On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com>
wrote:

> My bad I meant to say I believe the provided justification is
> inappropriate.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com>
> wrote:
>
>> So looking at the PR it does not appear to be removing any RDD APIs but
>> the justification provided for changing the ML backend to use the DataFrame
>> APIs is indeed concerning.
>>
>> This PR appears to have been merged without proper review (or providing
>> an opportunity for review).
>>
>> I’d like to remind people of the expectations we decided on together —
>> https://spark.apache.org/committers.html
>>
>> I believe the provided justification for the change and would ask that we
>> revert this PR so that a proper discussion can take place.
>>
>> “
>> In databricks runtime, RDD read / write API has some issue for certain
>> storage types that requires the account key, but Dataframe read / write API
>> works.
>> “
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>>
>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>> <mar...@databricks.com.invalid> wrote:
>>
>>> I took a quick look at the PR and would like to understand your concern
>>> better about:
>>>
>>> >  SparkSession is heavier than SparkContext
>>>
>>> It looks like the PR is using the active SparkSession, not creating a
>>> new one etc. I would highly appreciate it if you could help me understand
>>> this situation better.
>>>
>>> Thanks a lot!
>>>
>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Apache Spark's RDD API plays an essential and invaluable role from the
>>>> beginning and it will be even if it's not supported by Spark Connect.
>>>>
>>>> I have a concern about a recent activity which replaces RDD with
>>>> SparkSession blindly.
>>>>
>>>> For instance,
>>>>
>>>> https://github.com/apache/spark/pull/47328
>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>> Dataframe read / write API
>>>>
>>>> This PR doesn't look proper to me in two ways.
>>>> - SparkSession is heavier than SparkContext
>>>> - According to the following PR description, the background is also
>>>> hidden in the community.
>>>>
>>>>   > # Why are the changes needed?
>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>> certain storage types
>>>>   > that requires the account key, but Dataframe read / write API works.
>>>>
>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>> storage's issue or not because it's not testable in the community test
>>>> coverage.
>>>>
>>>> I'm wondering if the Apache Spark community aims to move away from the
>>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>>>> Connect` is not even GA in the community?
>>>>
>>>> Dongjoon.
>>>>
>>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to