Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

Denny Lee Tue, 04 Feb 2025 20:05:03 -0800

+1 (non-binding) on this proposal.  Just as long as there are no schedule
concerns - similar to Mridul and Dongjoon’s call outs, then yes, I think
this would be helpful for adoption.    Thanks!



On Tue, Feb 4, 2025 at 18:43 huaxin gao <[email protected]> wrote:

> I support publishing an additional Spark distribution with Spark Connect
> enabled in Spark 4.0 to boost Spark adoption. I also share Dongjoon's
> concern regarding potential schedule delays. As long as we monitor the
> timeline closely and thoroughly document any PRs that do not make it into
> the RC, we should be in good shape.
>
> I am casting my +1 on this proposal.
>
>
>
> On Tue, Feb 4, 2025 at 6:10 PM Mridul Muralidharan <[email protected]>
> wrote:
>
>>
>> +1 to new distribution mechanisms which will increase Spark adoption !
>>
>> I do agree with Dongjoon’s concerns that this should not result in
>> slipping the schedule; something to watch out for.
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <[email protected]> wrote:
>>
>>> I am fine with providing another option +1 with leaving others as are.
>>> Once the vote passes, we should probably make it ready ASAP - I don't think
>>> it will need a lot of changes in any event.
>>>
>>> On Wed, 5 Feb 2025 at 02:40, DB Tsai <[email protected]> wrote:
>>>
>>>> Many of the remaining PRs relate to Spark ML Connect support, but they
>>>> are not critical blockers for offering an additional Spark distribution
>>>> with Spark Connect enabled by default in Spark 4.0, allowing users to try
>>>> it out and provide more feedback.
>>>>
>>>> I agree that we should not postpone the Spark 4.0 release. If these PRs
>>>> do not land before the RC cut, we should ensure they are properly
>>>> documented.
>>>>
>>>> Thanks,
>>>>
>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>
>>>> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>> Many new feature `Connect` patches are still landing `branch-4.0`
>>>> during the QA period after February 1st.
>>>>
>>>> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala
>>>> Client
>>>> SPARK-50104 Support SparkSession.executeCommand in Connect
>>>> SPARK-50943 Support `Correlation` on Connect
>>>> SPARK-50133 Support DataFrame conversion to table argument in Spark
>>>> Connect Python Client
>>>> SPARK-50942 Support `ChiSquareTest` on Connect
>>>> SPARK-50899 Support PrefixSpan on connect
>>>> SPARK-51060 Support `QuantileDiscretizer` on Connect
>>>> SPARK-50974 Add support foldCol for CrossValidator on connect
>>>> SPARK-51015 Support RFormulaModel.toString on Connect
>>>> SPARK-50843 Support return a new model from existing one
>>>>
>>>> AFAIK, what we can agree on in the community is only that `Connect`
>>>> development is unfinished yet.
>>>> - Since `Connect` development is unfinished yet, more patches will land
>>>> if we want it to be complete.
>>>> - Since `Connect` development is unfinished yet, there exists more
>>>> concerns on adding this as a new distribution.
>>>>
>>>> That's the reason why I asked about the release schedule only.
>>>> We need to consider not only your new patch, but also the remaining
>>>> `Connect` PRs
>>>> in order to deliver the new proposed distribution meaningfully and
>>>> completely in Spark 4.0.
>>>>
>>>> So, let me ask you again. Are you sure that there will be no delay?
>>>> According to the commit history, I'm wondering if
>>>> both Herman and Ruifeng agree with you or not.
>>>>
>>>> To be clear, if there is no harm to the Apache Spark community,
>>>> I'll give +1 of course. Why not?
>>>>
>>>> Thanks,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <[email protected]> wrote:
>>>>
>>>>> Hi Dongjoon,
>>>>>
>>>>> This is a big decision but not a big project. We just need to update
>>>>> the release scripts to produce the additional Spark distribution. If 
>>>>> people
>>>>> are positive about this, I can start to implement the script changes now
>>>>> and merge it after this proposal has been voted on and approved.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Wenchen.
>>>>>>
>>>>>> I'm wondering if this implies any delay of the existing QA and RC1
>>>>>> schedule or not.
>>>>>>
>>>>>> If then, why don't we schedule this new alternative proposal on Spark
>>>>>> 4.1 properly?
>>>>>>
>>>>>> Best regards,
>>>>>> Dongjoon
>>>>>>
>>>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <[email protected]> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> There is partial agreement and consensus that Spark Connect is
>>>>>>> crucial for the future stability of Spark APIs for both end users and
>>>>>>> developers. At the same time, a couple of PMC members raised concerns 
>>>>>>> about
>>>>>>> making Spark Connect the default in the upcoming Spark 4.0 release. I’m
>>>>>>> proposing an alternative approach here: publish an additional Spark
>>>>>>> distribution with Spark Connect enabled by default. This approach will 
>>>>>>> help
>>>>>>> promote the adoption of Spark Connect among new users while allowing us 
>>>>>>> to
>>>>>>> gather valuable feedback. A separate distribution with Spark Connect
>>>>>>> enabled by default can promote future adoption of Spark Connect for
>>>>>>> languages like Rust, Go, or Scala 3.
>>>>>>>
>>>>>>> Here are the details of the proposal:
>>>>>>>
>>>>>>>    - Spark 4.0 will include three PyPI packages:
>>>>>>>       - pyspark: The classic package.
>>>>>>>       - pyspark-client: The thin Spark Connect Python client. Note,
>>>>>>>       in the Spark 4.0 preview releases, we have published the 
>>>>>>> pyspark-connect
>>>>>>>       package for the thin client, we will need to rename it in the 
>>>>>>> official 4.0
>>>>>>>       release.
>>>>>>>       - pyspark-connect: Spark Connect enabled by default.
>>>>>>>    - An additional tarball will be added to the Spark 4.0 download
>>>>>>>    page with updated scripts (spark-submit, spark-shell, etc.) to 
>>>>>>> enable Spark
>>>>>>>    Connect by default.
>>>>>>>    - A new Docker image will be provided with Spark Connect enabled
>>>>>>>    by default.
>>>>>>>
>>>>>>> By taking this approach, we can make Spark Connect more visible and
>>>>>>> accessible to users, which is more effective than simply asking them to
>>>>>>> configure it manually.
>>>>>>>
>>>>>>> Looking forward to hearing your thoughts!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>>
>>>>>>
>>>>

Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

Reply via email to