+1 (non-binding) on this proposal. Just as long as there are no schedule concerns - similar to Mridul and Dongjoon’s call outs, then yes, I think this would be helpful for adoption. Thanks!
On Tue, Feb 4, 2025 at 18:43 huaxin gao <huaxin.ga...@gmail.com> wrote: > I support publishing an additional Spark distribution with Spark Connect > enabled in Spark 4.0 to boost Spark adoption. I also share Dongjoon's > concern regarding potential schedule delays. As long as we monitor the > timeline closely and thoroughly document any PRs that do not make it into > the RC, we should be in good shape. > > I am casting my +1 on this proposal. > > > > On Tue, Feb 4, 2025 at 6:10 PM Mridul Muralidharan <mri...@gmail.com> > wrote: > >> >> +1 to new distribution mechanisms which will increase Spark adoption ! >> >> I do agree with Dongjoon’s concerns that this should not result in >> slipping the schedule; something to watch out for. >> >> Regards, >> Mridul >> >> >> >> On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <gurwls...@apache.org> wrote: >> >>> I am fine with providing another option +1 with leaving others as are. >>> Once the vote passes, we should probably make it ready ASAP - I don't think >>> it will need a lot of changes in any event. >>> >>> On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote: >>> >>>> Many of the remaining PRs relate to Spark ML Connect support, but they >>>> are not critical blockers for offering an additional Spark distribution >>>> with Spark Connect enabled by default in Spark 4.0, allowing users to try >>>> it out and provide more feedback. >>>> >>>> I agree that we should not postpone the Spark 4.0 release. If these PRs >>>> do not land before the RC cut, we should ensure they are properly >>>> documented. >>>> >>>> Thanks, >>>> >>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>> >>>> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>> Many new feature `Connect` patches are still landing `branch-4.0` >>>> during the QA period after February 1st. >>>> >>>> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala >>>> Client >>>> SPARK-50104 Support SparkSession.executeCommand in Connect >>>> SPARK-50943 Support `Correlation` on Connect >>>> SPARK-50133 Support DataFrame conversion to table argument in Spark >>>> Connect Python Client >>>> SPARK-50942 Support `ChiSquareTest` on Connect >>>> SPARK-50899 Support PrefixSpan on connect >>>> SPARK-51060 Support `QuantileDiscretizer` on Connect >>>> SPARK-50974 Add support foldCol for CrossValidator on connect >>>> SPARK-51015 Support RFormulaModel.toString on Connect >>>> SPARK-50843 Support return a new model from existing one >>>> >>>> AFAIK, what we can agree on in the community is only that `Connect` >>>> development is unfinished yet. >>>> - Since `Connect` development is unfinished yet, more patches will land >>>> if we want it to be complete. >>>> - Since `Connect` development is unfinished yet, there exists more >>>> concerns on adding this as a new distribution. >>>> >>>> That's the reason why I asked about the release schedule only. >>>> We need to consider not only your new patch, but also the remaining >>>> `Connect` PRs >>>> in order to deliver the new proposed distribution meaningfully and >>>> completely in Spark 4.0. >>>> >>>> So, let me ask you again. Are you sure that there will be no delay? >>>> According to the commit history, I'm wondering if >>>> both Herman and Ruifeng agree with you or not. >>>> >>>> To be clear, if there is no harm to the Apache Spark community, >>>> I'll give +1 of course. Why not? >>>> >>>> Thanks, >>>> Dongjoon. >>>> >>>> >>>> >>>> >>>> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>>> Hi Dongjoon, >>>>> >>>>> This is a big decision but not a big project. We just need to update >>>>> the release scripts to produce the additional Spark distribution. If >>>>> people >>>>> are positive about this, I can start to implement the script changes now >>>>> and merge it after this proposal has been voted on and approved. >>>>> >>>>> Thanks, >>>>> Wenchen >>>>> >>>>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, Wenchen. >>>>>> >>>>>> I'm wondering if this implies any delay of the existing QA and RC1 >>>>>> schedule or not. >>>>>> >>>>>> If then, why don't we schedule this new alternative proposal on Spark >>>>>> 4.1 properly? >>>>>> >>>>>> Best regards, >>>>>> Dongjoon >>>>>> >>>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> There is partial agreement and consensus that Spark Connect is >>>>>>> crucial for the future stability of Spark APIs for both end users and >>>>>>> developers. At the same time, a couple of PMC members raised concerns >>>>>>> about >>>>>>> making Spark Connect the default in the upcoming Spark 4.0 release. I’m >>>>>>> proposing an alternative approach here: publish an additional Spark >>>>>>> distribution with Spark Connect enabled by default. This approach will >>>>>>> help >>>>>>> promote the adoption of Spark Connect among new users while allowing us >>>>>>> to >>>>>>> gather valuable feedback. A separate distribution with Spark Connect >>>>>>> enabled by default can promote future adoption of Spark Connect for >>>>>>> languages like Rust, Go, or Scala 3. >>>>>>> >>>>>>> Here are the details of the proposal: >>>>>>> >>>>>>> - Spark 4.0 will include three PyPI packages: >>>>>>> - pyspark: The classic package. >>>>>>> - pyspark-client: The thin Spark Connect Python client. Note, >>>>>>> in the Spark 4.0 preview releases, we have published the >>>>>>> pyspark-connect >>>>>>> package for the thin client, we will need to rename it in the >>>>>>> official 4.0 >>>>>>> release. >>>>>>> - pyspark-connect: Spark Connect enabled by default. >>>>>>> - An additional tarball will be added to the Spark 4.0 download >>>>>>> page with updated scripts (spark-submit, spark-shell, etc.) to >>>>>>> enable Spark >>>>>>> Connect by default. >>>>>>> - A new Docker image will be provided with Spark Connect enabled >>>>>>> by default. >>>>>>> >>>>>>> By taking this approach, we can make Spark Connect more visible and >>>>>>> accessible to users, which is more effective than simply asking them to >>>>>>> configure it manually. >>>>>>> >>>>>>> Looking forward to hearing your thoughts! >>>>>>> >>>>>>> Thanks, >>>>>>> Wenchen >>>>>>> >>>>>> >>>>