I support publishing an additional Spark distribution with Spark Connect enabled in Spark 4.0 to boost Spark adoption. I also share Dongjoon's concern regarding potential schedule delays. As long as we monitor the timeline closely and thoroughly document any PRs that do not make it into the RC, we should be in good shape.
I am casting my +1 on this proposal. On Tue, Feb 4, 2025 at 6:10 PM Mridul Muralidharan <mri...@gmail.com> wrote: > > +1 to new distribution mechanisms which will increase Spark adoption ! > > I do agree with Dongjoon’s concerns that this should not result in > slipping the schedule; something to watch out for. > > Regards, > Mridul > > > > On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <gurwls...@apache.org> wrote: > >> I am fine with providing another option +1 with leaving others as are. >> Once the vote passes, we should probably make it ready ASAP - I don't think >> it will need a lot of changes in any event. >> >> On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote: >> >>> Many of the remaining PRs relate to Spark ML Connect support, but they >>> are not critical blockers for offering an additional Spark distribution >>> with Spark Connect enabled by default in Spark 4.0, allowing users to try >>> it out and provide more feedback. >>> >>> I agree that we should not postpone the Spark 4.0 release. If these PRs >>> do not land before the RC cut, we should ensure they are properly >>> documented. >>> >>> Thanks, >>> >>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>> >>> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>> Many new feature `Connect` patches are still landing `branch-4.0` >>> during the QA period after February 1st. >>> >>> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala >>> Client >>> SPARK-50104 Support SparkSession.executeCommand in Connect >>> SPARK-50943 Support `Correlation` on Connect >>> SPARK-50133 Support DataFrame conversion to table argument in Spark >>> Connect Python Client >>> SPARK-50942 Support `ChiSquareTest` on Connect >>> SPARK-50899 Support PrefixSpan on connect >>> SPARK-51060 Support `QuantileDiscretizer` on Connect >>> SPARK-50974 Add support foldCol for CrossValidator on connect >>> SPARK-51015 Support RFormulaModel.toString on Connect >>> SPARK-50843 Support return a new model from existing one >>> >>> AFAIK, what we can agree on in the community is only that `Connect` >>> development is unfinished yet. >>> - Since `Connect` development is unfinished yet, more patches will land >>> if we want it to be complete. >>> - Since `Connect` development is unfinished yet, there exists more >>> concerns on adding this as a new distribution. >>> >>> That's the reason why I asked about the release schedule only. >>> We need to consider not only your new patch, but also the remaining >>> `Connect` PRs >>> in order to deliver the new proposed distribution meaningfully and >>> completely in Spark 4.0. >>> >>> So, let me ask you again. Are you sure that there will be no delay? >>> According to the commit history, I'm wondering if >>> both Herman and Ruifeng agree with you or not. >>> >>> To be clear, if there is no harm to the Apache Spark community, >>> I'll give +1 of course. Why not? >>> >>> Thanks, >>> Dongjoon. >>> >>> >>> >>> >>> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Hi Dongjoon, >>>> >>>> This is a big decision but not a big project. We just need to update >>>> the release scripts to produce the additional Spark distribution. If people >>>> are positive about this, I can start to implement the script changes now >>>> and merge it after this proposal has been voted on and approved. >>>> >>>> Thanks, >>>> Wenchen >>>> >>>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, Wenchen. >>>>> >>>>> I'm wondering if this implies any delay of the existing QA and RC1 >>>>> schedule or not. >>>>> >>>>> If then, why don't we schedule this new alternative proposal on Spark >>>>> 4.1 properly? >>>>> >>>>> Best regards, >>>>> Dongjoon >>>>> >>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> There is partial agreement and consensus that Spark Connect is >>>>>> crucial for the future stability of Spark APIs for both end users and >>>>>> developers. At the same time, a couple of PMC members raised concerns >>>>>> about >>>>>> making Spark Connect the default in the upcoming Spark 4.0 release. I’m >>>>>> proposing an alternative approach here: publish an additional Spark >>>>>> distribution with Spark Connect enabled by default. This approach will >>>>>> help >>>>>> promote the adoption of Spark Connect among new users while allowing us >>>>>> to >>>>>> gather valuable feedback. A separate distribution with Spark Connect >>>>>> enabled by default can promote future adoption of Spark Connect for >>>>>> languages like Rust, Go, or Scala 3. >>>>>> >>>>>> Here are the details of the proposal: >>>>>> >>>>>> - Spark 4.0 will include three PyPI packages: >>>>>> - pyspark: The classic package. >>>>>> - pyspark-client: The thin Spark Connect Python client. Note, >>>>>> in the Spark 4.0 preview releases, we have published the >>>>>> pyspark-connect >>>>>> package for the thin client, we will need to rename it in the >>>>>> official 4.0 >>>>>> release. >>>>>> - pyspark-connect: Spark Connect enabled by default. >>>>>> - An additional tarball will be added to the Spark 4.0 download >>>>>> page with updated scripts (spark-submit, spark-shell, etc.) to enable >>>>>> Spark >>>>>> Connect by default. >>>>>> - A new Docker image will be provided with Spark Connect enabled >>>>>> by default. >>>>>> >>>>>> By taking this approach, we can make Spark Connect more visible and >>>>>> accessible to users, which is more effective than simply asking them to >>>>>> configure it manually. >>>>>> >>>>>> Looking forward to hearing your thoughts! >>>>>> >>>>>> Thanks, >>>>>> Wenchen >>>>>> >>>>> >>>