Many new feature `Connect` patches are still landing `branch-4.0`
during the QA period after February 1st.

SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala
Client
SPARK-50104 Support SparkSession.executeCommand in Connect
SPARK-50943 Support `Correlation` on Connect
SPARK-50133 Support DataFrame conversion to table argument in Spark Connect
Python Client
SPARK-50942 Support `ChiSquareTest` on Connect
SPARK-50899 Support PrefixSpan on connect
SPARK-51060 Support `QuantileDiscretizer` on Connect
SPARK-50974 Add support foldCol for CrossValidator on connect
SPARK-51015 Support RFormulaModel.toString on Connect
SPARK-50843 Support return a new model from existing one

AFAIK, what we can agree on in the community is only that `Connect`
development is unfinished yet.
- Since `Connect` development is unfinished yet, more patches will land if
we want it to be complete.
- Since `Connect` development is unfinished yet, there exists more concerns
on adding this as a new distribution.

That's the reason why I asked about the release schedule only.
We need to consider not only your new patch, but also the remaining
`Connect` PRs
in order to deliver the new proposed distribution meaningfully and
completely in Spark 4.0.

So, let me ask you again. Are you sure that there will be no delay?
According to the commit history, I'm wondering if
both Herman and Ruifeng agree with you or not.

To be clear, if there is no harm to the Apache Spark community,
I'll give +1 of course. Why not?

Thanks,
Dongjoon.




On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi Dongjoon,
>
> This is a big decision but not a big project. We just need to update the
> release scripts to produce the additional Spark distribution. If people are
> positive about this, I can start to implement the script changes now and
> merge it after this proposal has been voted on and approved.
>
> Thanks,
> Wenchen
>
> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Wenchen.
>>
>> I'm wondering if this implies any delay of the existing QA and RC1
>> schedule or not.
>>
>> If then, why don't we schedule this new alternative proposal on Spark 4.1
>> properly?
>>
>> Best regards,
>> Dongjoon
>>
>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> There is partial agreement and consensus that Spark Connect is crucial
>>> for the future stability of Spark APIs for both end users and developers.
>>> At the same time, a couple of PMC members raised concerns about making
>>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing
>>> an alternative approach here: publish an additional Spark distribution with
>>> Spark Connect enabled by default. This approach will help promote the
>>> adoption of Spark Connect among new users while allowing us to gather
>>> valuable feedback. A separate distribution with Spark Connect enabled by
>>> default can promote future adoption of Spark Connect for languages like
>>> Rust, Go, or Scala 3.
>>>
>>> Here are the details of the proposal:
>>>
>>>    - Spark 4.0 will include three PyPI packages:
>>>       - pyspark: The classic package.
>>>       - pyspark-client: The thin Spark Connect Python client. Note, in
>>>       the Spark 4.0 preview releases, we have published the pyspark-connect
>>>       package for the thin client, we will need to rename it in the 
>>> official 4.0
>>>       release.
>>>       - pyspark-connect: Spark Connect enabled by default.
>>>    - An additional tarball will be added to the Spark 4.0 download page
>>>    with updated scripts (spark-submit, spark-shell, etc.) to enable Spark
>>>    Connect by default.
>>>    - A new Docker image will be provided with Spark Connect enabled by
>>>    default.
>>>
>>> By taking this approach, we can make Spark Connect more visible and
>>> accessible to users, which is more effective than simply asking them to
>>> configure it manually.
>>>
>>> Looking forward to hearing your thoughts!
>>>
>>> Thanks,
>>> Wenchen
>>>
>>

Reply via email to