Even if the change is reasonable (and I can see arguments both ways), it's important that we follow the process we agreed on. Merging a PR without discussion* in ~ 2 hours from the initial proposal is not enough time to reach a lazy consensus. If it was a small bug-fix I could understand but this was a non-trivial change.
* It was approved by another committer but without any discussion, and the approver & code author work for the same employer mentioned as the justification for the change. On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> wrote: > I think we should have not mentioned a specific vendor there. The change > also shouldn't repartition. We should create a partition 1. > > But in general leveraging Catalyst optimizer and SQL engine there is a > good idea as we can leverage all optimization there. For example, it will > use UTF8 encoding instead of a plan string ser/de. We made similar changes > in JSON and CSV schema inference (it was an RDD before) > > On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> > wrote: > >> My bad I meant to say I believe the provided justification is >> inappropriate. >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> >> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >> wrote: >> >>> So looking at the PR it does not appear to be removing any RDD APIs but >>> the justification provided for changing the ML backend to use the DataFrame >>> APIs is indeed concerning. >>> >>> This PR appears to have been merged without proper review (or providing >>> an opportunity for review). >>> >>> I’d like to remind people of the expectations we decided on together — >>> https://spark.apache.org/committers.html >>> >>> I believe the provided justification for the change and would ask that >>> we revert this PR so that a proper discussion can take place. >>> >>> “ >>> In databricks runtime, RDD read / write API has some issue for certain >>> storage types that requires the account key, but Dataframe read / write API >>> works. >>> “ >>> >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> >>> >>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>> <mar...@databricks.com.invalid> wrote: >>> >>>> I took a quick look at the PR and would like to understand your concern >>>> better about: >>>> >>>> > SparkSession is heavier than SparkContext >>>> >>>> It looks like the PR is using the active SparkSession, not creating a >>>> new one etc. I would highly appreciate it if you could help me understand >>>> this situation better. >>>> >>>> Thanks a lot! >>>> >>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, All. >>>>> >>>>> Apache Spark's RDD API plays an essential and invaluable role from the >>>>> beginning and it will be even if it's not supported by Spark Connect. >>>>> >>>>> I have a concern about a recent activity which replaces RDD with >>>>> SparkSession blindly. >>>>> >>>>> For instance, >>>>> >>>>> https://github.com/apache/spark/pull/47328 >>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>> Dataframe read / write API >>>>> >>>>> This PR doesn't look proper to me in two ways. >>>>> - SparkSession is heavier than SparkContext >>>>> - According to the following PR description, the background is also >>>>> hidden in the community. >>>>> >>>>> > # Why are the changes needed? >>>>> > In databricks runtime, RDD read / write API has some issue for >>>>> certain storage types >>>>> > that requires the account key, but Dataframe read / write API >>>>> works. >>>>> >>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>> storage's issue or not because it's not testable in the community test >>>>> coverage. >>>>> >>>>> I'm wondering if the Apache Spark community aims to move away from the >>>>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark >>>>> Connect` is not even GA in the community? >>>>> >>>>> Dongjoon. >>>>> >>>> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau