👍 On Sun, Jul 14, 2024 at 1:07 AM Holden Karau <holden.ka...@gmail.com> wrote:
> Thank you :) > > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <gurwls...@apache.org> wrote: > >> Reverted, and opened a new one https://github.com/apache/spark/pull/47341 >> . >> >> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote: >> >>> Yeah that's fine. I'll revert and open a fresh PR including my own >>> followup when I get back home later today. >>> >>> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> Even if the change is reasonable (and I can see arguments both ways), >>>> it's important that we follow the process we agreed on. Merging a PR >>>> without discussion* in ~ 2 hours from the initial proposal is not enough >>>> time to reach a lazy consensus. If it was a small bug-fix I could >>>> understand but this was a non-trivial change. >>>> >>>> >>>> * It was approved by another committer but without any discussion, and >>>> the approver & code author work for the same employer mentioned as the >>>> justification for the change. >>>> >>>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> >>>> wrote: >>>> >>>>> I think we should have not mentioned a specific vendor there. The >>>>> change also shouldn't repartition. We should create a partition 1. >>>>> >>>>> But in general leveraging Catalyst optimizer and SQL engine there is a >>>>> good idea as we can leverage all optimization there. For example, it will >>>>> use UTF8 encoding instead of a plan string ser/de. We made similar changes >>>>> in JSON and CSV schema inference (it was an RDD before) >>>>> >>>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> My bad I meant to say I believe the provided justification is >>>>>> inappropriate. >>>>>> >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>>> >>>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> So looking at the PR it does not appear to be removing any RDD APIs >>>>>>> but the justification provided for changing the ML backend to use the >>>>>>> DataFrame APIs is indeed concerning. >>>>>>> >>>>>>> This PR appears to have been merged without proper review (or >>>>>>> providing an opportunity for review). >>>>>>> >>>>>>> I’d like to remind people of the expectations we decided on together >>>>>>> — >>>>>>> https://spark.apache.org/committers.html >>>>>>> >>>>>>> I believe the provided justification for the change and would ask >>>>>>> that we revert this PR so that a proper discussion can take place. >>>>>>> >>>>>>> “ >>>>>>> In databricks runtime, RDD read / write API has some issue for >>>>>>> certain storage types that requires the account key, but Dataframe read >>>>>>> / >>>>>>> write API works. >>>>>>> “ >>>>>>> >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> >>>>>> >>>>>>> >>>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>>>>>> <mar...@databricks.com.invalid> wrote: >>>>>>> >>>>>>>> I took a quick look at the PR and would like to understand your >>>>>>>> concern better about: >>>>>>>> >>>>>>>> > SparkSession is heavier than SparkContext >>>>>>>> >>>>>>>> It looks like the PR is using the active SparkSession, not creating >>>>>>>> a new one etc. I would highly appreciate it if you could help me >>>>>>>> understand >>>>>>>> this situation better. >>>>>>>> >>>>>>>> Thanks a lot! >>>>>>>> >>>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun < >>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, All. >>>>>>>>> >>>>>>>>> Apache Spark's RDD API plays an essential and invaluable role from >>>>>>>>> the beginning and it will be even if it's not supported by Spark >>>>>>>>> Connect. >>>>>>>>> >>>>>>>>> I have a concern about a recent activity which replaces RDD with >>>>>>>>> SparkSession blindly. >>>>>>>>> >>>>>>>>> For instance, >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/pull/47328 >>>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>>>>>> Dataframe read / write API >>>>>>>>> >>>>>>>>> This PR doesn't look proper to me in two ways. >>>>>>>>> - SparkSession is heavier than SparkContext >>>>>>>>> - According to the following PR description, the background is >>>>>>>>> also hidden in the community. >>>>>>>>> >>>>>>>>> > # Why are the changes needed? >>>>>>>>> > In databricks runtime, RDD read / write API has some issue for >>>>>>>>> certain storage types >>>>>>>>> > that requires the account key, but Dataframe read / write API >>>>>>>>> works. >>>>>>>>> >>>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>>>>>> storage's issue or not because it's not testable in the community test >>>>>>>>> coverage. >>>>>>>>> >>>>>>>>> I'm wondering if the Apache Spark community aims to move away from >>>>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because >>>>>>>>> `Spark Connect` is not even GA in the community? >>>>>>>>> >>>>>>>>> Dongjoon. >>>>>>>>> >>>>>>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>>