Yeah that's fine. I'll revert and open a fresh PR including my own followup when I get back home later today.
On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com> wrote: > Even if the change is reasonable (and I can see arguments both ways), it's > important that we follow the process we agreed on. Merging a PR without > discussion* in ~ 2 hours from the initial proposal is not enough time to > reach a lazy consensus. If it was a small bug-fix I could understand but > this was a non-trivial change. > > > * It was approved by another committer but without any discussion, and the > approver & code author work for the same employer mentioned as the > justification for the change. > > On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> wrote: > >> I think we should have not mentioned a specific vendor there. The change >> also shouldn't repartition. We should create a partition 1. >> >> But in general leveraging Catalyst optimizer and SQL engine there is a >> good idea as we can leverage all optimization there. For example, it will >> use UTF8 encoding instead of a plan string ser/de. We made similar changes >> in JSON and CSV schema inference (it was an RDD before) >> >> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> >> wrote: >> >>> My bad I meant to say I believe the provided justification is >>> inappropriate. >>> >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >>> >>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> So looking at the PR it does not appear to be removing any RDD APIs but >>>> the justification provided for changing the ML backend to use the DataFrame >>>> APIs is indeed concerning. >>>> >>>> This PR appears to have been merged without proper review (or providing >>>> an opportunity for review). >>>> >>>> I’d like to remind people of the expectations we decided on together — >>>> https://spark.apache.org/committers.html >>>> >>>> I believe the provided justification for the change and would ask that >>>> we revert this PR so that a proper discussion can take place. >>>> >>>> “ >>>> In databricks runtime, RDD read / write API has some issue for certain >>>> storage types that requires the account key, but Dataframe read / write API >>>> works. >>>> “ >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >>>> >>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>>> <mar...@databricks.com.invalid> wrote: >>>> >>>>> I took a quick look at the PR and would like to understand your >>>>> concern better about: >>>>> >>>>> > SparkSession is heavier than SparkContext >>>>> >>>>> It looks like the PR is using the active SparkSession, not creating a >>>>> new one etc. I would highly appreciate it if you could help me understand >>>>> this situation better. >>>>> >>>>> Thanks a lot! >>>>> >>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, All. >>>>>> >>>>>> Apache Spark's RDD API plays an essential and invaluable role from >>>>>> the beginning and it will be even if it's not supported by Spark Connect. >>>>>> >>>>>> I have a concern about a recent activity which replaces RDD with >>>>>> SparkSession blindly. >>>>>> >>>>>> For instance, >>>>>> >>>>>> https://github.com/apache/spark/pull/47328 >>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>>> Dataframe read / write API >>>>>> >>>>>> This PR doesn't look proper to me in two ways. >>>>>> - SparkSession is heavier than SparkContext >>>>>> - According to the following PR description, the background is also >>>>>> hidden in the community. >>>>>> >>>>>> > # Why are the changes needed? >>>>>> > In databricks runtime, RDD read / write API has some issue for >>>>>> certain storage types >>>>>> > that requires the account key, but Dataframe read / write API >>>>>> works. >>>>>> >>>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>>> storage's issue or not because it's not testable in the community test >>>>>> coverage. >>>>>> >>>>>> I'm wondering if the Apache Spark community aims to move away from >>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because >>>>>> `Spark Connect` is not even GA in the community? >>>>>> >>>>>> Dongjoon. >>>>>> >>>>> > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >