Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Mridul Muralidharan Thu, 13 Mar 2025 23:32:31 -0700

  I agree with Mark, imo this is a qualified veto.
We should give Dongjoon the opportunity to give his clarification, if any.


I do realize this delays the RC process, but this deserves to be looked
into carefully.

Thanks,
Mridul


On Thu, Mar 13, 2025 at 9:35 PM Mark Hamstra <[email protected]> wrote:

> Absolutely not!
>
> This is clearly a vote on a code change, not on a procedural issue or
> a package release. The code change has been vetoed by a -1 vote by a
> qualified voter.
>
> On Thu, Mar 13, 2025 at 6:58 PM Jungtaek Lim
> <[email protected]> wrote:
> >
> > Likewise I said, I'm concluding the VOTE since we ensure the criteria (3
> +1 binding, 1 -1 binding, and also +1s from non-binding).
> >
> > I don't consider -1 as a veto as I explained, as we should have multiple
> -1s if we go for VOTE with the current codebase. (+1 in this proposal is
> effectively -1 in another proposal.)
> >
> > The vote followed the Apache Voting Process with the type of "package
> release" (which we tend to use in dev@ for VOTE). I guess it could have
> also done with "procedural issues" which is less strict, but then this
> fulfills both types of votes which should be OK.
> >
> > The current codebase is "accidentally" representing another proposal and
> it is never intended. I don't find the way I can -1 to the current
> codebase, and make a different change neither bound to any proposal to be
> fair.
> >
> > I don't want to block the release because of the above. So, let's change
> the current codebase the way we discussed and voted here. Reverting this
> decision should require another VOTE.
> >
> > Thanks to everyone who voted!
> >
> > On Thu, Mar 13, 2025 at 4:54 PM Jungtaek Lim <
> [email protected]> wrote:
> >>
> >> Thanks to everyone who participated and voted!
> >>
> >> Now I can technically conclude the VOTE, but I'm willing to wait till
> US daytime tomorrow, to give some time for Dongjoon to revisit this.
> >>
> >> I'll conclude the vote around 6PM PST tomorrow regardless of his vote.
> It's ideal to see us have no -1, but having one -1 doesn't block this vote
> and we can move forward.
> >>
> >> On Thu, Mar 13, 2025 at 4:42 PM Yang Jie <[email protected]> wrote:
> >>>
> >>> forgot to mention in my last reply, my stance is +1
> >>>
> >>> Jie Yang
> >>>
> >>> On 2025/03/13 07:08:12 Russell Jurney wrote:
> >>> > Sure, +1 non-binding.
> >>> >
> >>> > On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim <
> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > Russell,
> >>> > >
> >>> > > Of course, we hear people' voices who aren't having binding votes
> as well.
> >>> > > Personally I think it's more important than committers/PMC
> members'  VOTE
> >>> > > this time since we can be biased and be far from user experience.
> >>> > >
> >>> > > Could you please explicitly cast your vote, like +1 (non-binding)?
> You
> >>> > > seem to agree with the proposal. Thanks!
> >>> > >
> >>> > > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney <
> [email protected]>
> >>> > > wrote:
> >>> > >
> >>> > >> I'm just a lurker and aspiring contributor, but as a Spark user
> upgrading
> >>> > >> twice is very confusing and would cause many or most users to
> fail to
> >>> > >> upgrade successfully to Spark 4 on a first go. That seems like a
> very bad
> >>> > >> user experience. I thought it was worthwhile stating this out
> loud.
> >>> > >>
> >>> > >> Russell
> >>> > >>
> >>> > >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]>
> wrote:
> >>> > >>
> >>> > >>> this vote is to allow streaming queries which had been ever run
> in Spark
> >>> > >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be
> upgraded with
> >>> > >>>> Spark 3.5.5+ in prior".
> >>> > >>>
> >>> > >>>
> >>> > >>> In the history of Apache Spark, have we ever required users to
> upgrade
> >>> > >>> to the next maintenance release before moving to a new feature
> or major
> >>> > >>> release?
> >>> > >>>
> >>> > >>> Xiao
> >>> > >>>
> >>> > >>> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
> >>> > >>>
> >>> > >>>> +1 (non-binding)
> >>> > >>>>
> >>> > >>>> It's a pretty in the weeds issue with how Structured Streaming
> works
> >>> > >>>> under the hood that's kinda hard to understand if you're not
> familiar with
> >>> > >>>> it. The migration logic doesn't mean users can still use the
> old config,
> >>> > >>>> it's purely behind the scenes to fix checkpoint metadata in
> streams created
> >>> > >>>> in 3.5.4. The 5 lines of code it takes to address a weird edge
> case for
> >>> > >>>> certain users that's already gone from master shouldn't be a
> huge deal.
> >>> > >>>>
> >>> > >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]>
> wrote:
> >>> > >>>>
> >>> > >>>>>
> >>> > >>>>> To Sean, you're right, I'm very sorry.
> >>> > >>>>>
> >>> > >>>>> From the perspective of compatibility and migratability, I
> think we
> >>> > >>>>> should migrate this logic to 4.0.0 and keep it in the codebase
> for a longer
> >>> > >>>>> time (or permanently), because we can't predict which version
> users of
> >>> > >>>>> 3.5.4 will choose next.
> >>> > >>>>>
> >>> > >>>>>
> >>> > >>>>> I don't want to discuss the so-called vendor issue.
> >>> > >>>>>
> >>> > >>>>> I withdraw my previous -1.
> >>> > >>>>>
> >>> > >>>>> Jie Yang.
> >>> > >>>>>
> >>> > >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
> >>> > >>>>> > Guys, let’s be honest about what we’re discussing here.
> >>> > >>>>> >
> >>> > >>>>> > If this is a migration issue, why would we even need a vote?
> We’ve
> >>> > >>>>> been
> >>> > >>>>> > consistently adding configurations to restore legacy behavior
> >>> > >>>>> instead of
> >>> > >>>>> > removing them because we understand the challenges of
> upgrading Spark
> >>> > >>>>> > versions. Our goal has always been to make upgrades easier,
> even if
> >>> > >>>>> it
> >>> > >>>>> > means carrying some technical debt. I don’t think we want to
> change
> >>> > >>>>> that
> >>> > >>>>> > culture now.
> >>> > >>>>> >
> >>> > >>>>> > If the concern is about vendor names appearing in the
> codebase, then
> >>> > >>>>> why is
> >>> > >>>>> > it a big deal this time when vendor names are already present
> >>> > >>>>> elsewhere? If
> >>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can
> someone
> >>> > >>>>> point to
> >>> > >>>>> > the specific policy we’re violating?
> >>> > >>>>> >
> >>> > >>>>> > If the vote is about adding migration logic to ease the
> upgrade from
> >>> > >>>>> 3.5.4
> >>> > >>>>> > to 4.0.0, then +1, why not?
> >>> > >>>>> >
> >>> > >>>>> > Thanks,
> >>> > >>>>> > Wenchen
> >>> > >>>>> >
> >>> > >>>>> >
> >>> > >>>>> >
> >>> > >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
> >>> > >>>>> [email protected]>
> >>> > >>>>> > wrote:
> >>> > >>>>> >
> >>> > >>>>> > > Well said, Sean. Sorry I made you keep around here since
> it might
> >>> > >>>>> not be
> >>> > >>>>> > > clearly stated. My bad.
> >>> > >>>>> > >
> >>> > >>>>> > > Yang, how could we ever tolerate the fact there are "other"
> >>> > >>>>> occurrences of
> >>> > >>>>> > > vendor names in the codebase? Please go and search
> "databricks" in
> >>> > >>>>> the
> >>> > >>>>> > > codebase and be surprised.
> >>> > >>>>> > >
> >>> > >>>>> > > If we believe that having vendor names in the codebase will
> >>> > >>>>> increase
> >>> > >>>>> > > the occurrence of making mistakes, why didn't we have a
> discussion
> >>> > >>>>> thread
> >>> > >>>>> > > earlier to remove all occurrences altogether? This is
> super tricky
> >>> > >>>>> because
> >>> > >>>>> > > I can even start to argue we have "Apple" as a vendor name
> in
> >>> > >>>>> Apache Spark
> >>> > >>>>> > > codebase. I'm not saying we use "apple" in the test data.
> See
> >>> > >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
> >>> > >>>>> `isMacOnMSeries` or
> >>> > >>>>> > > `isMacOnSilicon` is enough.
> >>> > >>>>> > >
> >>> > >>>>> > > We really need to draw a line where we disallow vendor
> names on it
> >>> > >>>>> - if
> >>> > >>>>> > > it's the entire codebase, I don't really think it is
> realistic.
> >>> > >>>>> > >
> >>> > >>>>> > > This was really a mistake, and it was definitely not from
> >>> > >>>>> referring to the
> >>> > >>>>> > > existing codebase. Not having a vendor name does not change
> >>> > >>>>> anything on the
> >>> > >>>>> > > chance of encountering this issue again. If we really
> care, we
> >>> > >>>>> should think
> >>> > >>>>> > > about style checking, which is the only viable way to
> catch the
> >>> > >>>>> mistake.
> >>> > >>>>> > > Again, I'd argue we have to have a bunch of vendor names
> in that
> >>> > >>>>> style
> >>> > >>>>> > > check, not just the problematic vendor name.
> >>> > >>>>> > >
> >>> > >>>>> > >
> >>> > >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <
> [email protected]>
> >>> > >>>>> wrote:
> >>> > >>>>> > >
> >>> > >>>>> > >> Doesn't the migration code 'clear' the debt?
> >>> > >>>>> > >> The proposal is not to continue to support the config.
> >>> > >>>>> > >> I feel like people are not quite understanding the
> change, and
> >>> > >>>>> objecting
> >>> > >>>>> > >> to something that doesn't exist.
> >>> > >>>>> > >> It's a shame, as this seems like something not even worth
> >>> > >>>>> discussing. I
> >>> > >>>>> > >> don't know why this triggered this much discussion. We
> have kept
> >>> > >>>>> deprecated
> >>> > >>>>> > >> methods without blinking, which is in comparison much
> bigger.
> >>> > >>>>> > >> Can we maybe ask you review the actual change in question?
> >>> > >>>>> > >>
> >>> > >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <
> [email protected]>
> >>> > >>>>> wrote:
> >>> > >>>>> > >>
> >>> > >>>>> > >>> -1
> >>> > >>>>> > >>> Remove migration logic of incorrect `spark.databricks.*`
> >>> > >>>>> configuration
> >>> > >>>>> > >>> in Spark 4.0.0 because I think this configuration was
> initially
> >>> > >>>>> introduced
> >>> > >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design
> intent.
> >>> > >>>>> Although the
> >>> > >>>>> > >>> immediate maintenance cost of retaining this
> configuration
> >>> > >>>>> currently seems
> >>> > >>>>> > >>> limited, as subsequent versions iterate and user habits
> form, it
> >>> > >>>>> may lead
> >>> > >>>>> > >>> to the continuous accumulation of technical debt. When
> users
> >>> > >>>>> come to view
> >>> > >>>>> > >>> this configuration as one that can be relied on
> long-term,
> >>> > >>>>> future removal
> >>> > >>>>> > >>> may face greater resistance from users and could
> potentially
> >>> > >>>>> become an
> >>> > >>>>> > >>> entrenched and redundant configuration in the codebase.
> >>> > >>>>> Therefore, promptly
> >>> > >>>>> > >>> correcting this historically accidental configuration
> not only
> >>> > >>>>> maintains
> >>> > >>>>> > >>> the normativity of the Spark configuration system but
> also
> >>> > >>>>> prevents
> >>> > >>>>> > >>> unintended configurations from becoming de facto
> standards,
> >>> > >>>>> thereby
> >>> > >>>>> > >>> reducing long-term maintenance risks.
> >>> > >>>>> > >>>
> >>> > >>>>> > >>> Jie Yang
> >>> > >>>>> > >>>
> >>> > >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
> >>> > >>>>> > >>> > -1 because there exists a feasible migration path for
> Apache
> >>> > >>>>> Spark
> >>> > >>>>> > >>> 3.5.4 via Apache Spark 3.5.5.
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> > It's obvious that this Databricks' mistake already
> causes a
> >>> > >>>>> huge
> >>> > >>>>> > >>> communication cost in the Apache Spark community and is
> >>> > >>>>> suggesting a burden
> >>> > >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0
> and 4.1.0.
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> > Given that, I don't think
> >>> > >>>>> > >>> > - This is an inevitable or
> >>> > >>>>> > >>> > - This is 0 cost
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> > Dongjoon.
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
> >>> > >>>>> > >>> > > Starting from my +1 (non-binding).
> >>> > >>>>> > >>> > >
> >>> > >>>>> > >>> > > In addition, I propose to retain migration logic
> till Spark
> >>> > >>>>> 4.1.x and
> >>> > >>>>> > >>> > > remove it in Spark 4.2.0.
> >>> > >>>>> > >>> > >
> >>> > >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
> >>> > >>>>> > >>> [email protected]>
> >>> > >>>>> > >>> > > wrote:
> >>> > >>>>> > >>> > >
> >>> > >>>>> > >>> > > > Hi dev,
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > Please vote to retain migration logic of incorrect
> >>> > >>>>> > >>> `spark.databricks.*`
> >>> > >>>>> > >>> > > > configuration in Spark 4.0.x.
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > - DISCUSSION:
> >>> > >>>>> > >>> > > >
> >>> > >>>>>
> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
> >>> > >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being
> >>> > >>>>> exposed in
> >>> > >>>>> > >>> 3.5.4 in
> >>> > >>>>> > >>> > > > Spark 4.0.0+)
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > Specifically, please review this post
> >>> > >>>>> > >>> > > >
> >>> > >>>>>
> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
> >>> > >>>>> > >>> which
> >>> > >>>>> > >>> > > > explains pros and cons about the proposal -
> proposal is
> >>> > >>>>> about
> >>> > >>>>> > >>> "Option 1".
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > Simply speaking, this vote is to allow streaming
> queries
> >>> > >>>>> which had
> >>> > >>>>> > >>> been
> >>> > >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark
> 4.0.x,
> >>> > >>>>> "without
> >>> > >>>>> > >>> having to
> >>> > >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the
> vote
> >>> > >>>>> passes, we
> >>> > >>>>> > >>> will help
> >>> > >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to
> Spark
> >>> > >>>>> 4.0.x,
> >>> > >>>>> > >>> which would
> >>> > >>>>> > >>> > > > be almost 1 year.
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > The (only) cons in this option is having to retain
> the
> >>> > >>>>> incorrect
> >>> > >>>>> > >>> > > > configuration name as "string" in the codebase a
> bit
> >>> > >>>>> longer. The
> >>> > >>>>> > >>> code
> >>> > >>>>> > >>> > > > complexity of migration logic is arguably trivial.
> (link
> >>> > >>>>> > >>> > > > <
> >>> > >>>>> > >>>
> >>> > >>>>>
> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> > > > )
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone
> supports
> >>> > >>>>> including
> >>> > >>>>> > >>> migration
> >>> > >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast
> +1 here
> >>> > >>>>> and leave
> >>> > >>>>> > >>> the
> >>> > >>>>> > >>> > > > desired last minor version of Spark to retain this
> >>> > >>>>> migration logic.
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > The vote is open for the next 72 hours and passes
> if a
> >>> > >>>>> majority +1
> >>> > >>>>> > >>> PMC
> >>> > >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
> >>> > >>>>> `spark.databricks.*`
> >>> > >>>>> > >>> > > > configuration in Spark 4.0.x
> >>> > >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
> >>> > >>>>> `spark.databricks.*`
> >>> > >>>>> > >>> > > > configuration in Spark 4.0.0 because...
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > > > Thanks!
> >>> > >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
> >>> > >>>>> > >>> > > >
> >>> > >>>>> > >>> > >
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> >
> >>> > >>>>>
> ---------------------------------------------------------------------
> >>> > >>>>> > >>> > To unsubscribe e-mail:
> [email protected]
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>> >
> >>> > >>>>> > >>>
> >>> > >>>>> > >>>
> >>> > >>>>>
> ---------------------------------------------------------------------
> >>> > >>>>> > >>> To unsubscribe e-mail: [email protected]
> >>> > >>>>> > >>>
> >>> > >>>>> > >>>
> >>> > >>>>> >
> >>> > >>>>>
> >>> > >>>>>
> ---------------------------------------------------------------------
> >>> > >>>>> To unsubscribe e-mail: [email protected]
> >>> > >>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>> --
> >>> > >>>> Adam Binford
> >>> > >>>>
> >>> > >>>
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: [email protected]
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to