I think we should have not mentioned a specific vendor there. The change also shouldn't repartition. We should create a partition 1.
But in general leveraging Catalyst optimizer and SQL engine there is a good idea as we can leverage all optimization there. For example, it will use UTF8 encoding instead of a plan string ser/de. We made similar changes in JSON and CSV schema inference (it was an RDD before) On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> wrote: > My bad I meant to say I believe the provided justification is > inappropriate. > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> > wrote: > >> So looking at the PR it does not appear to be removing any RDD APIs but >> the justification provided for changing the ML backend to use the DataFrame >> APIs is indeed concerning. >> >> This PR appears to have been merged without proper review (or providing >> an opportunity for review). >> >> I’d like to remind people of the expectations we decided on together — >> https://spark.apache.org/committers.html >> >> I believe the provided justification for the change and would ask that we >> revert this PR so that a proper discussion can take place. >> >> “ >> In databricks runtime, RDD read / write API has some issue for certain >> storage types that requires the account key, but Dataframe read / write API >> works. >> “ >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > >> >> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >> <mar...@databricks.com.invalid> wrote: >> >>> I took a quick look at the PR and would like to understand your concern >>> better about: >>> >>> > SparkSession is heavier than SparkContext >>> >>> It looks like the PR is using the active SparkSession, not creating a >>> new one etc. I would highly appreciate it if you could help me understand >>> this situation better. >>> >>> Thanks a lot! >>> >>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Hi, All. >>>> >>>> Apache Spark's RDD API plays an essential and invaluable role from the >>>> beginning and it will be even if it's not supported by Spark Connect. >>>> >>>> I have a concern about a recent activity which replaces RDD with >>>> SparkSession blindly. >>>> >>>> For instance, >>>> >>>> https://github.com/apache/spark/pull/47328 >>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>> Dataframe read / write API >>>> >>>> This PR doesn't look proper to me in two ways. >>>> - SparkSession is heavier than SparkContext >>>> - According to the following PR description, the background is also >>>> hidden in the community. >>>> >>>> > # Why are the changes needed? >>>> > In databricks runtime, RDD read / write API has some issue for >>>> certain storage types >>>> > that requires the account key, but Dataframe read / write API works. >>>> >>>> In addition, we don't know if this PR fixes the mentioned unknown >>>> storage's issue or not because it's not testable in the community test >>>> coverage. >>>> >>>> I'm wondering if the Apache Spark community aims to move away from the >>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark >>>> Connect` is not even GA in the community? >>>> >>>> Dongjoon. >>>> >>>