Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Mridul Muralidharan Thu, 13 Mar 2025 23:43:02 -0700

FWIW, I am +1 on the proposal (though I missed the vote on this !)

Regards,
Mridul


On Fri, Mar 14, 2025 at 1:31 AM Mridul Muralidharan <[email protected]>
wrote:

>
>   I agree with Mark, imo this is a qualified veto.
> We should give Dongjoon the opportunity to give his clarification, if any.
>
> I do realize this delays the RC process, but this deserves to be looked
> into carefully.
>
> Thanks,
> Mridul
>
>
> On Thu, Mar 13, 2025 at 9:35 PM Mark Hamstra <[email protected]>
> wrote:
>
>> Absolutely not!
>>
>> This is clearly a vote on a code change, not on a procedural issue or
>> a package release. The code change has been vetoed by a -1 vote by a
>> qualified voter.
>>
>> On Thu, Mar 13, 2025 at 6:58 PM Jungtaek Lim
>> <[email protected]> wrote:
>> >
>> > Likewise I said, I'm concluding the VOTE since we ensure the criteria
>> (3 +1 binding, 1 -1 binding, and also +1s from non-binding).
>> >
>> > I don't consider -1 as a veto as I explained, as we should have
>> multiple -1s if we go for VOTE with the current codebase. (+1 in this
>> proposal is effectively -1 in another proposal.)
>> >
>> > The vote followed the Apache Voting Process with the type of "package
>> release" (which we tend to use in dev@ for VOTE). I guess it could have
>> also done with "procedural issues" which is less strict, but then this
>> fulfills both types of votes which should be OK.
>> >
>> > The current codebase is "accidentally" representing another proposal
>> and it is never intended. I don't find the way I can -1 to the current
>> codebase, and make a different change neither bound to any proposal to be
>> fair.
>> >
>> > I don't want to block the release because of the above. So, let's
>> change the current codebase the way we discussed and voted here. Reverting
>> this decision should require another VOTE.
>> >
>> > Thanks to everyone who voted!
>> >
>> > On Thu, Mar 13, 2025 at 4:54 PM Jungtaek Lim <
>> [email protected]> wrote:
>> >>
>> >> Thanks to everyone who participated and voted!
>> >>
>> >> Now I can technically conclude the VOTE, but I'm willing to wait till
>> US daytime tomorrow, to give some time for Dongjoon to revisit this.
>> >>
>> >> I'll conclude the vote around 6PM PST tomorrow regardless of his vote.
>> It's ideal to see us have no -1, but having one -1 doesn't block this vote
>> and we can move forward.
>> >>
>> >> On Thu, Mar 13, 2025 at 4:42 PM Yang Jie <[email protected]> wrote:
>> >>>
>> >>> forgot to mention in my last reply, my stance is +1
>> >>>
>> >>> Jie Yang
>> >>>
>> >>> On 2025/03/13 07:08:12 Russell Jurney wrote:
>> >>> > Sure, +1 non-binding.
>> >>> >
>> >>> > On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim <
>> [email protected]>
>> >>> > wrote:
>> >>> >
>> >>> > > Russell,
>> >>> > >
>> >>> > > Of course, we hear people' voices who aren't having binding votes
>> as well.
>> >>> > > Personally I think it's more important than committers/PMC
>> members'  VOTE
>> >>> > > this time since we can be biased and be far from user experience.
>> >>> > >
>> >>> > > Could you please explicitly cast your vote, like +1
>> (non-binding)? You
>> >>> > > seem to agree with the proposal. Thanks!
>> >>> > >
>> >>> > > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney <
>> [email protected]>
>> >>> > > wrote:
>> >>> > >
>> >>> > >> I'm just a lurker and aspiring contributor, but as a Spark user
>> upgrading
>> >>> > >> twice is very confusing and would cause many or most users to
>> fail to
>> >>> > >> upgrade successfully to Spark 4 on a first go. That seems like a
>> very bad
>> >>> > >> user experience. I thought it was worthwhile stating this out
>> loud.
>> >>> > >>
>> >>> > >> Russell
>> >>> > >>
>> >>> > >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]>
>> wrote:
>> >>> > >>
>> >>> > >>> this vote is to allow streaming queries which had been ever run
>> in Spark
>> >>> > >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be
>> upgraded with
>> >>> > >>>> Spark 3.5.5+ in prior".
>> >>> > >>>
>> >>> > >>>
>> >>> > >>> In the history of Apache Spark, have we ever required users to
>> upgrade
>> >>> > >>> to the next maintenance release before moving to a new feature
>> or major
>> >>> > >>> release?
>> >>> > >>>
>> >>> > >>> Xiao
>> >>> > >>>
>> >>> > >>> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
>> >>> > >>>
>> >>> > >>>> +1 (non-binding)
>> >>> > >>>>
>> >>> > >>>> It's a pretty in the weeds issue with how Structured Streaming
>> works
>> >>> > >>>> under the hood that's kinda hard to understand if you're not
>> familiar with
>> >>> > >>>> it. The migration logic doesn't mean users can still use the
>> old config,
>> >>> > >>>> it's purely behind the scenes to fix checkpoint metadata in
>> streams created
>> >>> > >>>> in 3.5.4. The 5 lines of code it takes to address a weird edge
>> case for
>> >>> > >>>> certain users that's already gone from master shouldn't be a
>> huge deal.
>> >>> > >>>>
>> >>> > >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]>
>> wrote:
>> >>> > >>>>
>> >>> > >>>>>
>> >>> > >>>>> To Sean, you're right, I'm very sorry.
>> >>> > >>>>>
>> >>> > >>>>> From the perspective of compatibility and migratability, I
>> think we
>> >>> > >>>>> should migrate this logic to 4.0.0 and keep it in the
>> codebase for a longer
>> >>> > >>>>> time (or permanently), because we can't predict which version
>> users of
>> >>> > >>>>> 3.5.4 will choose next.
>> >>> > >>>>>
>> >>> > >>>>>
>> >>> > >>>>> I don't want to discuss the so-called vendor issue.
>> >>> > >>>>>
>> >>> > >>>>> I withdraw my previous -1.
>> >>> > >>>>>
>> >>> > >>>>> Jie Yang.
>> >>> > >>>>>
>> >>> > >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
>> >>> > >>>>> > Guys, let’s be honest about what we’re discussing here.
>> >>> > >>>>> >
>> >>> > >>>>> > If this is a migration issue, why would we even need a
>> vote? We’ve
>> >>> > >>>>> been
>> >>> > >>>>> > consistently adding configurations to restore legacy
>> behavior
>> >>> > >>>>> instead of
>> >>> > >>>>> > removing them because we understand the challenges of
>> upgrading Spark
>> >>> > >>>>> > versions. Our goal has always been to make upgrades easier,
>> even if
>> >>> > >>>>> it
>> >>> > >>>>> > means carrying some technical debt. I don’t think we want
>> to change
>> >>> > >>>>> that
>> >>> > >>>>> > culture now.
>> >>> > >>>>> >
>> >>> > >>>>> > If the concern is about vendor names appearing in the
>> codebase, then
>> >>> > >>>>> why is
>> >>> > >>>>> > it a big deal this time when vendor names are already
>> present
>> >>> > >>>>> elsewhere? If
>> >>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can
>> someone
>> >>> > >>>>> point to
>> >>> > >>>>> > the specific policy we’re violating?
>> >>> > >>>>> >
>> >>> > >>>>> > If the vote is about adding migration logic to ease the
>> upgrade from
>> >>> > >>>>> 3.5.4
>> >>> > >>>>> > to 4.0.0, then +1, why not?
>> >>> > >>>>> >
>> >>> > >>>>> > Thanks,
>> >>> > >>>>> > Wenchen
>> >>> > >>>>> >
>> >>> > >>>>> >
>> >>> > >>>>> >
>> >>> > >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
>> >>> > >>>>> [email protected]>
>> >>> > >>>>> > wrote:
>> >>> > >>>>> >
>> >>> > >>>>> > > Well said, Sean. Sorry I made you keep around here since
>> it might
>> >>> > >>>>> not be
>> >>> > >>>>> > > clearly stated. My bad.
>> >>> > >>>>> > >
>> >>> > >>>>> > > Yang, how could we ever tolerate the fact there are
>> "other"
>> >>> > >>>>> occurrences of
>> >>> > >>>>> > > vendor names in the codebase? Please go and search
>> "databricks" in
>> >>> > >>>>> the
>> >>> > >>>>> > > codebase and be surprised.
>> >>> > >>>>> > >
>> >>> > >>>>> > > If we believe that having vendor names in the codebase
>> will
>> >>> > >>>>> increase
>> >>> > >>>>> > > the occurrence of making mistakes, why didn't we have a
>> discussion
>> >>> > >>>>> thread
>> >>> > >>>>> > > earlier to remove all occurrences altogether? This is
>> super tricky
>> >>> > >>>>> because
>> >>> > >>>>> > > I can even start to argue we have "Apple" as a vendor
>> name in
>> >>> > >>>>> Apache Spark
>> >>> > >>>>> > > codebase. I'm not saying we use "apple" in the test data.
>> See
>> >>> > >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
>> >>> > >>>>> `isMacOnMSeries` or
>> >>> > >>>>> > > `isMacOnSilicon` is enough.
>> >>> > >>>>> > >
>> >>> > >>>>> > > We really need to draw a line where we disallow vendor
>> names on it
>> >>> > >>>>> - if
>> >>> > >>>>> > > it's the entire codebase, I don't really think it is
>> realistic.
>> >>> > >>>>> > >
>> >>> > >>>>> > > This was really a mistake, and it was definitely not from
>> >>> > >>>>> referring to the
>> >>> > >>>>> > > existing codebase. Not having a vendor name does not
>> change
>> >>> > >>>>> anything on the
>> >>> > >>>>> > > chance of encountering this issue again. If we really
>> care, we
>> >>> > >>>>> should think
>> >>> > >>>>> > > about style checking, which is the only viable way to
>> catch the
>> >>> > >>>>> mistake.
>> >>> > >>>>> > > Again, I'd argue we have to have a bunch of vendor names
>> in that
>> >>> > >>>>> style
>> >>> > >>>>> > > check, not just the problematic vendor name.
>> >>> > >>>>> > >
>> >>> > >>>>> > >
>> >>> > >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <
>> [email protected]>
>> >>> > >>>>> wrote:
>> >>> > >>>>> > >
>> >>> > >>>>> > >> Doesn't the migration code 'clear' the debt?
>> >>> > >>>>> > >> The proposal is not to continue to support the config.
>> >>> > >>>>> > >> I feel like people are not quite understanding the
>> change, and
>> >>> > >>>>> objecting
>> >>> > >>>>> > >> to something that doesn't exist.
>> >>> > >>>>> > >> It's a shame, as this seems like something not even worth
>> >>> > >>>>> discussing. I
>> >>> > >>>>> > >> don't know why this triggered this much discussion. We
>> have kept
>> >>> > >>>>> deprecated
>> >>> > >>>>> > >> methods without blinking, which is in comparison much
>> bigger.
>> >>> > >>>>> > >> Can we maybe ask you review the actual change in
>> question?
>> >>> > >>>>> > >>
>> >>> > >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <
>> [email protected]>
>> >>> > >>>>> wrote:
>> >>> > >>>>> > >>
>> >>> > >>>>> > >>> -1
>> >>> > >>>>> > >>> Remove migration logic of incorrect `spark.databricks.*`
>> >>> > >>>>> configuration
>> >>> > >>>>> > >>> in Spark 4.0.0 because I think this configuration was
>> initially
>> >>> > >>>>> introduced
>> >>> > >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design
>> intent.
>> >>> > >>>>> Although the
>> >>> > >>>>> > >>> immediate maintenance cost of retaining this
>> configuration
>> >>> > >>>>> currently seems
>> >>> > >>>>> > >>> limited, as subsequent versions iterate and user habits
>> form, it
>> >>> > >>>>> may lead
>> >>> > >>>>> > >>> to the continuous accumulation of technical debt. When
>> users
>> >>> > >>>>> come to view
>> >>> > >>>>> > >>> this configuration as one that can be relied on
>> long-term,
>> >>> > >>>>> future removal
>> >>> > >>>>> > >>> may face greater resistance from users and could
>> potentially
>> >>> > >>>>> become an
>> >>> > >>>>> > >>> entrenched and redundant configuration in the codebase.
>> >>> > >>>>> Therefore, promptly
>> >>> > >>>>> > >>> correcting this historically accidental configuration
>> not only
>> >>> > >>>>> maintains
>> >>> > >>>>> > >>> the normativity of the Spark configuration system but
>> also
>> >>> > >>>>> prevents
>> >>> > >>>>> > >>> unintended configurations from becoming de facto
>> standards,
>> >>> > >>>>> thereby
>> >>> > >>>>> > >>> reducing long-term maintenance risks.
>> >>> > >>>>> > >>>
>> >>> > >>>>> > >>> Jie Yang
>> >>> > >>>>> > >>>
>> >>> > >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
>> >>> > >>>>> > >>> > -1 because there exists a feasible migration path for
>> Apache
>> >>> > >>>>> Spark
>> >>> > >>>>> > >>> 3.5.4 via Apache Spark 3.5.5.
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> > It's obvious that this Databricks' mistake already
>> causes a
>> >>> > >>>>> huge
>> >>> > >>>>> > >>> communication cost in the Apache Spark community and is
>> >>> > >>>>> suggesting a burden
>> >>> > >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0
>> and 4.1.0.
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> > Given that, I don't think
>> >>> > >>>>> > >>> > - This is an inevitable or
>> >>> > >>>>> > >>> > - This is 0 cost
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> > Dongjoon.
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
>> >>> > >>>>> > >>> > > Starting from my +1 (non-binding).
>> >>> > >>>>> > >>> > >
>> >>> > >>>>> > >>> > > In addition, I propose to retain migration logic
>> till Spark
>> >>> > >>>>> 4.1.x and
>> >>> > >>>>> > >>> > > remove it in Spark 4.2.0.
>> >>> > >>>>> > >>> > >
>> >>> > >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
>> >>> > >>>>> > >>> [email protected]>
>> >>> > >>>>> > >>> > > wrote:
>> >>> > >>>>> > >>> > >
>> >>> > >>>>> > >>> > > > Hi dev,
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > Please vote to retain migration logic of incorrect
>> >>> > >>>>> > >>> `spark.databricks.*`
>> >>> > >>>>> > >>> > > > configuration in Spark 4.0.x.
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > - DISCUSSION:
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>>
>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
>> >>> > >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config
>> being
>> >>> > >>>>> exposed in
>> >>> > >>>>> > >>> 3.5.4 in
>> >>> > >>>>> > >>> > > > Spark 4.0.0+)
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > Specifically, please review this post
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>>
>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
>> >>> > >>>>> > >>> which
>> >>> > >>>>> > >>> > > > explains pros and cons about the proposal -
>> proposal is
>> >>> > >>>>> about
>> >>> > >>>>> > >>> "Option 1".
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > Simply speaking, this vote is to allow streaming
>> queries
>> >>> > >>>>> which had
>> >>> > >>>>> > >>> been
>> >>> > >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark
>> 4.0.x,
>> >>> > >>>>> "without
>> >>> > >>>>> > >>> having to
>> >>> > >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the
>> vote
>> >>> > >>>>> passes, we
>> >>> > >>>>> > >>> will help
>> >>> > >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4
>> to Spark
>> >>> > >>>>> 4.0.x,
>> >>> > >>>>> > >>> which would
>> >>> > >>>>> > >>> > > > be almost 1 year.
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > The (only) cons in this option is having to
>> retain the
>> >>> > >>>>> incorrect
>> >>> > >>>>> > >>> > > > configuration name as "string" in the codebase a
>> bit
>> >>> > >>>>> longer. The
>> >>> > >>>>> > >>> code
>> >>> > >>>>> > >>> > > > complexity of migration logic is arguably
>> trivial. (link
>> >>> > >>>>> > >>> > > > <
>> >>> > >>>>> > >>>
>> >>> > >>>>>
>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> > > > )
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone
>> supports
>> >>> > >>>>> including
>> >>> > >>>>> > >>> migration
>> >>> > >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast
>> +1 here
>> >>> > >>>>> and leave
>> >>> > >>>>> > >>> the
>> >>> > >>>>> > >>> > > > desired last minor version of Spark to retain this
>> >>> > >>>>> migration logic.
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > The vote is open for the next 72 hours and passes
>> if a
>> >>> > >>>>> majority +1
>> >>> > >>>>> > >>> PMC
>> >>> > >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
>> >>> > >>>>> `spark.databricks.*`
>> >>> > >>>>> > >>> > > > configuration in Spark 4.0.x
>> >>> > >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
>> >>> > >>>>> `spark.databricks.*`
>> >>> > >>>>> > >>> > > > configuration in Spark 4.0.0 because...
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > > > Thanks!
>> >>> > >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
>> >>> > >>>>> > >>> > > >
>> >>> > >>>>> > >>> > >
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> >
>> >>> > >>>>>
>> ---------------------------------------------------------------------
>> >>> > >>>>> > >>> > To unsubscribe e-mail:
>> [email protected]
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>> >
>> >>> > >>>>> > >>>
>> >>> > >>>>> > >>>
>> >>> > >>>>>
>> ---------------------------------------------------------------------
>> >>> > >>>>> > >>> To unsubscribe e-mail: [email protected]
>> >>> > >>>>> > >>>
>> >>> > >>>>> > >>>
>> >>> > >>>>> >
>> >>> > >>>>>
>> >>> > >>>>>
>> ---------------------------------------------------------------------
>> >>> > >>>>> To unsubscribe e-mail: [email protected]
>> >>> > >>>>>
>> >>> > >>>>>
>> >>> > >>>>
>> >>> > >>>> --
>> >>> > >>>> Adam Binford
>> >>> > >>>>
>> >>> > >>>
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: [email protected]
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to