My bad I meant to say I believe the provided justification is inappropriate.
Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> wrote: > So looking at the PR it does not appear to be removing any RDD APIs but > the justification provided for changing the ML backend to use the DataFrame > APIs is indeed concerning. > > This PR appears to have been merged without proper review (or providing an > opportunity for review). > > I’d like to remind people of the expectations we decided on together — > https://spark.apache.org/committers.html > > I believe the provided justification for the change and would ask that we > revert this PR so that a proper discussion can take place. > > “ > In databricks runtime, RDD read / write API has some issue for certain > storage types that requires the account key, but Dataframe read / write API > works. > “ > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Fri, Jul 12, 2024 at 1:02 PM Martin Grund <mar...@databricks.com.invalid> > wrote: > >> I took a quick look at the PR and would like to understand your concern >> better about: >> >> > SparkSession is heavier than SparkContext >> >> It looks like the PR is using the active SparkSession, not creating a new >> one etc. I would highly appreciate it if you could help me understand this >> situation better. >> >> Thanks a lot! >> >> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, All. >>> >>> Apache Spark's RDD API plays an essential and invaluable role from the >>> beginning and it will be even if it's not supported by Spark Connect. >>> >>> I have a concern about a recent activity which replaces RDD with >>> SparkSession blindly. >>> >>> For instance, >>> >>> https://github.com/apache/spark/pull/47328 >>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>> Dataframe read / write API >>> >>> This PR doesn't look proper to me in two ways. >>> - SparkSession is heavier than SparkContext >>> - According to the following PR description, the background is also >>> hidden in the community. >>> >>> > # Why are the changes needed? >>> > In databricks runtime, RDD read / write API has some issue for >>> certain storage types >>> > that requires the account key, but Dataframe read / write API works. >>> >>> In addition, we don't know if this PR fixes the mentioned unknown >>> storage's issue or not because it's not testable in the community test >>> coverage. >>> >>> I'm wondering if the Apache Spark community aims to move away from the >>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark >>> Connect` is not even GA in the community? >>> >>> Dongjoon. >>> >>