Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Sean Owen Thu, 13 Mar 2025 07:38:53 -0700

I'm not sure if a VOTE is appropriate here, but I also do not see any valid
technical objection here. I don't think this can be considered a valid
'veto' even if we were thinking of it that way.
I think there are other non-technical factors influencing this position. I
believe we proceed with Jungtaek's proposal.


On Thu, Mar 13, 2025 at 9:17 AM Dongjoon Hyun <[email protected]>
wrote:

> We are having this vote to give clarity by keeping all records of the
> community decisions and stances during building a community consensus. All
> votes are important and counted.
>
> To Jungtaek, I already casted my veto properly and have been tracking the
> thread. You don't need to say to me to revisit because I've been here.
>
> To Xiao, in the history of Apache Spark, have we ever made a mistake to
> ship a vendor-ownership like `spark.databricks.*`? I believe you are
> switching the real root cause and the bad consequence here.
> > In the history of Apache Spark, have we ever required users to upgrade
> to the next maintenance release before moving to a new feature or major
> release?
>
> Thanks,
> Dongjoon.
>
>
> On Thu, Mar 13, 2025 at 12:58 AM Jungtaek Lim <
> [email protected]> wrote:
>
>> Thanks to everyone who participated and voted!
>>
>> Now I can technically conclude the VOTE, but I'm willing to wait till US
>> daytime tomorrow, to give some time for Dongjoon to revisit this.
>>
>> I'll conclude the vote around 6PM PST tomorrow regardless of his vote.
>> It's ideal to see us have no -1, but having one -1 doesn't block this
>> vote and we can move forward.
>>
>> On Thu, Mar 13, 2025 at 4:42 PM Yang Jie <[email protected]> wrote:
>>
>>> forgot to mention in my last reply, my stance is +1
>>>
>>> Jie Yang
>>>
>>> On 2025/03/13 07:08:12 Russell Jurney wrote:
>>> > Sure, +1 non-binding.
>>> >
>>> > On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim <
>>> [email protected]>
>>> > wrote:
>>> >
>>> > > Russell,
>>> > >
>>> > > Of course, we hear people' voices who aren't having binding votes as
>>> well.
>>> > > Personally I think it's more important than committers/PMC members'
>>> VOTE
>>> > > this time since we can be biased and be far from user experience.
>>> > >
>>> > > Could you please explicitly cast your vote, like +1 (non-binding)?
>>> You
>>> > > seem to agree with the proposal. Thanks!
>>> > >
>>> > > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney <
>>> [email protected]>
>>> > > wrote:
>>> > >
>>> > >> I'm just a lurker and aspiring contributor, but as a Spark user
>>> upgrading
>>> > >> twice is very confusing and would cause many or most users to fail
>>> to
>>> > >> upgrade successfully to Spark 4 on a first go. That seems like a
>>> very bad
>>> > >> user experience. I thought it was worthwhile stating this out loud.
>>> > >>
>>> > >> Russell
>>> > >>
>>> > >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]>
>>> wrote:
>>> > >>
>>> > >>> this vote is to allow streaming queries which had been ever run in
>>> Spark
>>> > >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be
>>> upgraded with
>>> > >>>> Spark 3.5.5+ in prior".
>>> > >>>
>>> > >>>
>>> > >>> In the history of Apache Spark, have we ever required users to
>>> upgrade
>>> > >>> to the next maintenance release before moving to a new feature or
>>> major
>>> > >>> release?
>>> > >>>
>>> > >>> Xiao
>>> > >>>
>>> > >>> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
>>> > >>>
>>> > >>>> +1 (non-binding)
>>> > >>>>
>>> > >>>> It's a pretty in the weeds issue with how Structured Streaming
>>> works
>>> > >>>> under the hood that's kinda hard to understand if you're not
>>> familiar with
>>> > >>>> it. The migration logic doesn't mean users can still use the old
>>> config,
>>> > >>>> it's purely behind the scenes to fix checkpoint metadata in
>>> streams created
>>> > >>>> in 3.5.4. The 5 lines of code it takes to address a weird edge
>>> case for
>>> > >>>> certain users that's already gone from master shouldn't be a huge
>>> deal.
>>> > >>>>
>>> > >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]>
>>> wrote:
>>> > >>>>
>>> > >>>>>
>>> > >>>>> To Sean, you're right, I'm very sorry.
>>> > >>>>>
>>> > >>>>> From the perspective of compatibility and migratability, I think
>>> we
>>> > >>>>> should migrate this logic to 4.0.0 and keep it in the codebase
>>> for a longer
>>> > >>>>> time (or permanently), because we can't predict which version
>>> users of
>>> > >>>>> 3.5.4 will choose next.
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> I don't want to discuss the so-called vendor issue.
>>> > >>>>>
>>> > >>>>> I withdraw my previous -1.
>>> > >>>>>
>>> > >>>>> Jie Yang.
>>> > >>>>>
>>> > >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
>>> > >>>>> > Guys, let’s be honest about what we’re discussing here.
>>> > >>>>> >
>>> > >>>>> > If this is a migration issue, why would we even need a vote?
>>> We’ve
>>> > >>>>> been
>>> > >>>>> > consistently adding configurations to restore legacy behavior
>>> > >>>>> instead of
>>> > >>>>> > removing them because we understand the challenges of
>>> upgrading Spark
>>> > >>>>> > versions. Our goal has always been to make upgrades easier,
>>> even if
>>> > >>>>> it
>>> > >>>>> > means carrying some technical debt. I don’t think we want to
>>> change
>>> > >>>>> that
>>> > >>>>> > culture now.
>>> > >>>>> >
>>> > >>>>> > If the concern is about vendor names appearing in the
>>> codebase, then
>>> > >>>>> why is
>>> > >>>>> > it a big deal this time when vendor names are already present
>>> > >>>>> elsewhere? If
>>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can
>>> someone
>>> > >>>>> point to
>>> > >>>>> > the specific policy we’re violating?
>>> > >>>>> >
>>> > >>>>> > If the vote is about adding migration logic to ease the
>>> upgrade from
>>> > >>>>> 3.5.4
>>> > >>>>> > to 4.0.0, then +1, why not?
>>> > >>>>> >
>>> > >>>>> > Thanks,
>>> > >>>>> > Wenchen
>>> > >>>>> >
>>> > >>>>> >
>>> > >>>>> >
>>> > >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
>>> > >>>>> [email protected]>
>>> > >>>>> > wrote:
>>> > >>>>> >
>>> > >>>>> > > Well said, Sean. Sorry I made you keep around here since it
>>> might
>>> > >>>>> not be
>>> > >>>>> > > clearly stated. My bad.
>>> > >>>>> > >
>>> > >>>>> > > Yang, how could we ever tolerate the fact there are "other"
>>> > >>>>> occurrences of
>>> > >>>>> > > vendor names in the codebase? Please go and search
>>> "databricks" in
>>> > >>>>> the
>>> > >>>>> > > codebase and be surprised.
>>> > >>>>> > >
>>> > >>>>> > > If we believe that having vendor names in the codebase will
>>> > >>>>> increase
>>> > >>>>> > > the occurrence of making mistakes, why didn't we have a
>>> discussion
>>> > >>>>> thread
>>> > >>>>> > > earlier to remove all occurrences altogether? This is super
>>> tricky
>>> > >>>>> because
>>> > >>>>> > > I can even start to argue we have "Apple" as a vendor name in
>>> > >>>>> Apache Spark
>>> > >>>>> > > codebase. I'm not saying we use "apple" in the test data. See
>>> > >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
>>> > >>>>> `isMacOnMSeries` or
>>> > >>>>> > > `isMacOnSilicon` is enough.
>>> > >>>>> > >
>>> > >>>>> > > We really need to draw a line where we disallow vendor names
>>> on it
>>> > >>>>> - if
>>> > >>>>> > > it's the entire codebase, I don't really think it is
>>> realistic.
>>> > >>>>> > >
>>> > >>>>> > > This was really a mistake, and it was definitely not from
>>> > >>>>> referring to the
>>> > >>>>> > > existing codebase. Not having a vendor name does not change
>>> > >>>>> anything on the
>>> > >>>>> > > chance of encountering this issue again. If we really care,
>>> we
>>> > >>>>> should think
>>> > >>>>> > > about style checking, which is the only viable way to catch
>>> the
>>> > >>>>> mistake.
>>> > >>>>> > > Again, I'd argue we have to have a bunch of vendor names in
>>> that
>>> > >>>>> style
>>> > >>>>> > > check, not just the problematic vendor name.
>>> > >>>>> > >
>>> > >>>>> > >
>>> > >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]
>>> >
>>> > >>>>> wrote:
>>> > >>>>> > >
>>> > >>>>> > >> Doesn't the migration code 'clear' the debt?
>>> > >>>>> > >> The proposal is not to continue to support the config.
>>> > >>>>> > >> I feel like people are not quite understanding the change,
>>> and
>>> > >>>>> objecting
>>> > >>>>> > >> to something that doesn't exist.
>>> > >>>>> > >> It's a shame, as this seems like something not even worth
>>> > >>>>> discussing. I
>>> > >>>>> > >> don't know why this triggered this much discussion. We have
>>> kept
>>> > >>>>> deprecated
>>> > >>>>> > >> methods without blinking, which is in comparison much
>>> bigger.
>>> > >>>>> > >> Can we maybe ask you review the actual change in question?
>>> > >>>>> > >>
>>> > >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <
>>> [email protected]>
>>> > >>>>> wrote:
>>> > >>>>> > >>
>>> > >>>>> > >>> -1
>>> > >>>>> > >>> Remove migration logic of incorrect `spark.databricks.*`
>>> > >>>>> configuration
>>> > >>>>> > >>> in Spark 4.0.0 because I think this configuration was
>>> initially
>>> > >>>>> introduced
>>> > >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design intent.
>>> > >>>>> Although the
>>> > >>>>> > >>> immediate maintenance cost of retaining this configuration
>>> > >>>>> currently seems
>>> > >>>>> > >>> limited, as subsequent versions iterate and user habits
>>> form, it
>>> > >>>>> may lead
>>> > >>>>> > >>> to the continuous accumulation of technical debt. When
>>> users
>>> > >>>>> come to view
>>> > >>>>> > >>> this configuration as one that can be relied on long-term,
>>> > >>>>> future removal
>>> > >>>>> > >>> may face greater resistance from users and could
>>> potentially
>>> > >>>>> become an
>>> > >>>>> > >>> entrenched and redundant configuration in the codebase.
>>> > >>>>> Therefore, promptly
>>> > >>>>> > >>> correcting this historically accidental configuration not
>>> only
>>> > >>>>> maintains
>>> > >>>>> > >>> the normativity of the Spark configuration system but also
>>> > >>>>> prevents
>>> > >>>>> > >>> unintended configurations from becoming de facto standards,
>>> > >>>>> thereby
>>> > >>>>> > >>> reducing long-term maintenance risks.
>>> > >>>>> > >>>
>>> > >>>>> > >>> Jie Yang
>>> > >>>>> > >>>
>>> > >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
>>> > >>>>> > >>> > -1 because there exists a feasible migration path for
>>> Apache
>>> > >>>>> Spark
>>> > >>>>> > >>> 3.5.4 via Apache Spark 3.5.5.
>>> > >>>>> > >>> >
>>> > >>>>> > >>> > It's obvious that this Databricks' mistake already
>>> causes a
>>> > >>>>> huge
>>> > >>>>> > >>> communication cost in the Apache Spark community and is
>>> > >>>>> suggesting a burden
>>> > >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0 and
>>> 4.1.0.
>>> > >>>>> > >>> >
>>> > >>>>> > >>> > Given that, I don't think
>>> > >>>>> > >>> > - This is an inevitable or
>>> > >>>>> > >>> > - This is 0 cost
>>> > >>>>> > >>> >
>>> > >>>>> > >>> > Dongjoon.
>>> > >>>>> > >>> >
>>> > >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
>>> > >>>>> > >>> > > Starting from my +1 (non-binding).
>>> > >>>>> > >>> > >
>>> > >>>>> > >>> > > In addition, I propose to retain migration logic till
>>> Spark
>>> > >>>>> 4.1.x and
>>> > >>>>> > >>> > > remove it in Spark 4.2.0.
>>> > >>>>> > >>> > >
>>> > >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
>>> > >>>>> > >>> [email protected]>
>>> > >>>>> > >>> > > wrote:
>>> > >>>>> > >>> > >
>>> > >>>>> > >>> > > > Hi dev,
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > Please vote to retain migration logic of incorrect
>>> > >>>>> > >>> `spark.databricks.*`
>>> > >>>>> > >>> > > > configuration in Spark 4.0.x.
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > - DISCUSSION:
>>> > >>>>> > >>> > > >
>>> > >>>>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
>>> > >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being
>>> > >>>>> exposed in
>>> > >>>>> > >>> 3.5.4 in
>>> > >>>>> > >>> > > > Spark 4.0.0+)
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > Specifically, please review this post
>>> > >>>>> > >>> > > >
>>> > >>>>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
>>> > >>>>> > >>> which
>>> > >>>>> > >>> > > > explains pros and cons about the proposal - proposal
>>> is
>>> > >>>>> about
>>> > >>>>> > >>> "Option 1".
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > Simply speaking, this vote is to allow streaming
>>> queries
>>> > >>>>> which had
>>> > >>>>> > >>> been
>>> > >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark
>>> 4.0.x,
>>> > >>>>> "without
>>> > >>>>> > >>> having to
>>> > >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote
>>> > >>>>> passes, we
>>> > >>>>> > >>> will help
>>> > >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to
>>> Spark
>>> > >>>>> 4.0.x,
>>> > >>>>> > >>> which would
>>> > >>>>> > >>> > > > be almost 1 year.
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > The (only) cons in this option is having to retain
>>> the
>>> > >>>>> incorrect
>>> > >>>>> > >>> > > > configuration name as "string" in the codebase a bit
>>> > >>>>> longer. The
>>> > >>>>> > >>> code
>>> > >>>>> > >>> > > > complexity of migration logic is arguably trivial.
>>> (link
>>> > >>>>> > >>> > > > <
>>> > >>>>> > >>>
>>> > >>>>>
>>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
>>> > >>>>> > >>> >
>>> > >>>>> > >>> > > > )
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports
>>> > >>>>> including
>>> > >>>>> > >>> migration
>>> > >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1
>>> here
>>> > >>>>> and leave
>>> > >>>>> > >>> the
>>> > >>>>> > >>> > > > desired last minor version of Spark to retain this
>>> > >>>>> migration logic.
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > The vote is open for the next 72 hours and passes if
>>> a
>>> > >>>>> majority +1
>>> > >>>>> > >>> PMC
>>> > >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
>>> > >>>>> `spark.databricks.*`
>>> > >>>>> > >>> > > > configuration in Spark 4.0.x
>>> > >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
>>> > >>>>> `spark.databricks.*`
>>> > >>>>> > >>> > > > configuration in Spark 4.0.0 because...
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > > > Thanks!
>>> > >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
>>> > >>>>> > >>> > > >
>>> > >>>>> > >>> > >
>>> > >>>>> > >>> >
>>> > >>>>> > >>> >
>>> > >>>>>
>>> ---------------------------------------------------------------------
>>> > >>>>> > >>> > To unsubscribe e-mail: [email protected]
>>> > >>>>> > >>> >
>>> > >>>>> > >>> >
>>> > >>>>> > >>>
>>> > >>>>> > >>>
>>> > >>>>>
>>> ---------------------------------------------------------------------
>>> > >>>>> > >>> To unsubscribe e-mail: [email protected]
>>> > >>>>> > >>>
>>> > >>>>> > >>>
>>> > >>>>> >
>>> > >>>>>
>>> > >>>>>
>>> ---------------------------------------------------------------------
>>> > >>>>> To unsubscribe e-mail: [email protected]
>>> > >>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>>> --
>>> > >>>> Adam Binford
>>> > >>>>
>>> > >>>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to