Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

Adam Binford Wed, 05 Feb 2025 04:15:09 -0800

Long time Spark on YARN user with some maybe dumb questions but I'm
guessing other users might be wondering the same things.


First, what does "Spark Connect enabled by default" actually even mean? I
assume this is referring to the "spark.api.mode" discussion from before,
but even in that discussion there was no real description of what "connect"
API mode actually means at a technical level. I assume this is all
referring to the work being done in
https://github.com/apache/spark/pull/49107? Or does this simply mean
including the connect dependency in the distribution so you don't need to
add it externally (I'm pretty sure this already happened when it was moved
into sql/)?

How does having this additional distribution promote the future adoption of
other language clients? Are there future plans to be able to "spark-submit"
a Rust/Go/C# application using Spark Connect? Or is it simply the
discouraging use of the RDD API that will help adoption of other client
languages in the future?

>From my perspective the biggest barrier to entry to using Spark Connect is
the lack of built-in security features. And if the "connect" API mode is
actually launching a local Connect server behind the scenes, how is that
server being secured so that only the local Spark Session can connect to it?

The last question would just be how much confusion is this going to cause
users? With no context, I would assume I need the "connect" distribution so
I can launch a connect server and then use the "client" distribution to
connect to it, but in this case I believe all the "connect" distribution is
doing is enabling the spark.api.mode=connect by default? But that's also
not super clear in the description. The rush to make something the default
so quickly (it's still not even merged in yet?) seems a little suspicious
to me, especially in the environment today of open source supply chain
attacks. I understand we're at a major version boundary, but it seems like
something that could always be made the default later after it has actually
been tested in the real world.

On Tue, Feb 4, 2025 at 11:15 PM L. C. Hsieh <vii...@gmail.com> wrote:

> +1 for the additional option.
>
> Agreed that we should keep on track with the schedule. If as mentioned
> earlier that there are no critical blockers, it should be fine.
>
> On Tue, Feb 4, 2025 at 8:05 PM Denny Lee <denny.g....@gmail.com> wrote:
> >
> > +1 (non-binding) on this proposal.  Just as long as there are no
> schedule concerns - similar to Mridul and Dongjoon’s call outs, then yes, I
> think this would be helpful for adoption.    Thanks!
> >
> >
> > On Tue, Feb 4, 2025 at 18:43 huaxin gao <huaxin.ga...@gmail.com> wrote:
> >>
> >> I support publishing an additional Spark distribution with Spark
> Connect enabled in Spark 4.0 to boost Spark adoption. I also share
> Dongjoon's concern regarding potential schedule delays. As long as we
> monitor the timeline closely and thoroughly document any PRs that do not
> make it into the RC, we should be in good shape.
> >>
> >> I am casting my +1 on this proposal.
> >>
> >>
> >>
> >> On Tue, Feb 4, 2025 at 6:10 PM Mridul Muralidharan <mri...@gmail.com>
> wrote:
> >>>
> >>>
> >>> +1 to new distribution mechanisms which will increase Spark adoption !
> >>>
> >>> I do agree with Dongjoon’s concerns that this should not result in
> slipping the schedule; something to watch out for.
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>>
> >>>
> >>> On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <gurwls...@apache.org>
> wrote:
> >>>>
> >>>> I am fine with providing another option +1 with leaving others as
> are. Once the vote passes, we should probably make it ready ASAP - I don't
> think it will need a lot of changes in any event.
> >>>>
> >>>> On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote:
> >>>>>
> >>>>> Many of the remaining PRs relate to Spark ML Connect support, but
> they are not critical blockers for offering an additional Spark
> distribution with Spark Connect enabled by default in Spark 4.0, allowing
> users to try it out and provide more feedback.
> >>>>>
> >>>>> I agree that we should not postpone the Spark 4.0 release. If these
> PRs do not land before the RC cut, we should ensure they are properly
> documented.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >>>>>
> >>>>> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
> >>>>>
> >>>>> Many new feature `Connect` patches are still landing `branch-4.0`
> >>>>> during the QA period after February 1st.
> >>>>>
> >>>>> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect
> Scala Client
> >>>>> SPARK-50104 Support SparkSession.executeCommand in Connect
> >>>>> SPARK-50943 Support `Correlation` on Connect
> >>>>> SPARK-50133 Support DataFrame conversion to table argument in Spark
> Connect Python Client
> >>>>> SPARK-50942 Support `ChiSquareTest` on Connect
> >>>>> SPARK-50899 Support PrefixSpan on connect
> >>>>> SPARK-51060 Support `QuantileDiscretizer` on Connect
> >>>>> SPARK-50974 Add support foldCol for CrossValidator on connect
> >>>>> SPARK-51015 Support RFormulaModel.toString on Connect
> >>>>> SPARK-50843 Support return a new model from existing one
> >>>>>
> >>>>> AFAIK, what we can agree on in the community is only that `Connect`
> development is unfinished yet.
> >>>>> - Since `Connect` development is unfinished yet, more patches will
> land if we want it to be complete.
> >>>>> - Since `Connect` development is unfinished yet, there exists more
> concerns on adding this as a new distribution.
> >>>>>
> >>>>> That's the reason why I asked about the release schedule only.
> >>>>> We need to consider not only your new patch, but also the remaining
> `Connect` PRs
> >>>>> in order to deliver the new proposed distribution meaningfully and
> completely in Spark 4.0.
> >>>>>
> >>>>> So, let me ask you again. Are you sure that there will be no delay?
> >>>>> According to the commit history, I'm wondering if
> >>>>> both Herman and Ruifeng agree with you or not.
> >>>>>
> >>>>> To be clear, if there is no harm to the Apache Spark community,
> >>>>> I'll give +1 of course. Why not?
> >>>>>
> >>>>> Thanks,
> >>>>> Dongjoon.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Hi Dongjoon,
> >>>>>>
> >>>>>> This is a big decision but not a big project. We just need to
> update the release scripts to produce the additional Spark distribution. If
> people are positive about this, I can start to implement the script changes
> now and merge it after this proposal has been voted on and approved.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Wenchen
> >>>>>>
> >>>>>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi, Wenchen.
> >>>>>>>
> >>>>>>> I'm wondering if this implies any delay of the existing QA and RC1
> schedule or not.
> >>>>>>>
> >>>>>>> If then, why don't we schedule this new alternative proposal on
> Spark 4.1 properly?
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Dongjoon
> >>>>>>>
> >>>>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> There is partial agreement and consensus that Spark Connect is
> crucial for the future stability of Spark APIs for both end users and
> developers. At the same time, a couple of PMC members raised concerns about
> making Spark Connect the default in the upcoming Spark 4.0 release. I’m
> proposing an alternative approach here: publish an additional Spark
> distribution with Spark Connect enabled by default. This approach will help
> promote the adoption of Spark Connect among new users while allowing us to
> gather valuable feedback. A separate distribution with Spark Connect
> enabled by default can promote future adoption of Spark Connect for
> languages like Rust, Go, or Scala 3.
> >>>>>>>>
> >>>>>>>> Here are the details of the proposal:
> >>>>>>>>
> >>>>>>>> Spark 4.0 will include three PyPI packages:
> >>>>>>>>
> >>>>>>>> pyspark: The classic package.
> >>>>>>>> pyspark-client: The thin Spark Connect Python client. Note, in
> the Spark 4.0 preview releases, we have published the pyspark-connect
> package for the thin client, we will need to rename it in the official 4.0
> release.
> >>>>>>>> pyspark-connect: Spark Connect enabled by default.
> >>>>>>>>
> >>>>>>>> An additional tarball will be added to the Spark 4.0 download
> page with updated scripts (spark-submit, spark-shell, etc.) to enable Spark
> Connect by default.
> >>>>>>>> A new Docker image will be provided with Spark Connect enabled by
> default.
> >>>>>>>>
> >>>>>>>> By taking this approach, we can make Spark Connect more visible
> and accessible to users, which is more effective than simply asking them to
> configure it manually.
> >>>>>>>>
> >>>>>>>> Looking forward to hearing your thoughts!
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Wenchen
> >>>>>
> >>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Adam Binford

Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

Reply via email to