Many new feature `Connect` patches are still landing `branch-4.0` during the QA period after February 1st.
SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala Client SPARK-50104 Support SparkSession.executeCommand in Connect SPARK-50943 Support `Correlation` on Connect SPARK-50133 Support DataFrame conversion to table argument in Spark Connect Python Client SPARK-50942 Support `ChiSquareTest` on Connect SPARK-50899 Support PrefixSpan on connect SPARK-51060 Support `QuantileDiscretizer` on Connect SPARK-50974 Add support foldCol for CrossValidator on connect SPARK-51015 Support RFormulaModel.toString on Connect SPARK-50843 Support return a new model from existing one AFAIK, what we can agree on in the community is only that `Connect` development is unfinished yet. - Since `Connect` development is unfinished yet, more patches will land if we want it to be complete. - Since `Connect` development is unfinished yet, there exists more concerns on adding this as a new distribution. That's the reason why I asked about the release schedule only. We need to consider not only your new patch, but also the remaining `Connect` PRs in order to deliver the new proposed distribution meaningfully and completely in Spark 4.0. So, let me ask you again. Are you sure that there will be no delay? According to the commit history, I'm wondering if both Herman and Ruifeng agree with you or not. To be clear, if there is no harm to the Apache Spark community, I'll give +1 of course. Why not? Thanks, Dongjoon. On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote: > Hi Dongjoon, > > This is a big decision but not a big project. We just need to update the > release scripts to produce the additional Spark distribution. If people are > positive about this, I can start to implement the script changes now and > merge it after this proposal has been voted on and approved. > > Thanks, > Wenchen > > On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, Wenchen. >> >> I'm wondering if this implies any delay of the existing QA and RC1 >> schedule or not. >> >> If then, why don't we schedule this new alternative proposal on Spark 4.1 >> properly? >> >> Best regards, >> Dongjoon >> >> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Hi all, >>> >>> There is partial agreement and consensus that Spark Connect is crucial >>> for the future stability of Spark APIs for both end users and developers. >>> At the same time, a couple of PMC members raised concerns about making >>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing >>> an alternative approach here: publish an additional Spark distribution with >>> Spark Connect enabled by default. This approach will help promote the >>> adoption of Spark Connect among new users while allowing us to gather >>> valuable feedback. A separate distribution with Spark Connect enabled by >>> default can promote future adoption of Spark Connect for languages like >>> Rust, Go, or Scala 3. >>> >>> Here are the details of the proposal: >>> >>> - Spark 4.0 will include three PyPI packages: >>> - pyspark: The classic package. >>> - pyspark-client: The thin Spark Connect Python client. Note, in >>> the Spark 4.0 preview releases, we have published the pyspark-connect >>> package for the thin client, we will need to rename it in the >>> official 4.0 >>> release. >>> - pyspark-connect: Spark Connect enabled by default. >>> - An additional tarball will be added to the Spark 4.0 download page >>> with updated scripts (spark-submit, spark-shell, etc.) to enable Spark >>> Connect by default. >>> - A new Docker image will be provided with Spark Connect enabled by >>> default. >>> >>> By taking this approach, we can make Spark Connect more visible and >>> accessible to users, which is more effective than simply asking them to >>> configure it manually. >>> >>> Looking forward to hearing your thoughts! >>> >>> Thanks, >>> Wenchen >>> >>