Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Russell Jurney Wed, 12 Mar 2025 23:51:27 -0700

I'm just a lurker and aspiring contributor, but as a Spark user upgrading
twice is very confusing and would cause many or most users to fail to
upgrade successfully to Spark 4 on a first go. That seems like a very bad
user experience. I thought it was worthwhile stating this out loud.


Russell

On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]> wrote:

> this vote is to allow streaming queries which had been ever run in Spark
>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be upgraded with
>> Spark 3.5.5+ in prior".
>
>
> In the history of Apache Spark, have we ever required users to upgrade to
> the next maintenance release before moving to a new feature or major
> release?
>
> Xiao
>
> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
>
>> +1 (non-binding)
>>
>> It's a pretty in the weeds issue with how Structured Streaming works
>> under the hood that's kinda hard to understand if you're not familiar with
>> it. The migration logic doesn't mean users can still use the old config,
>> it's purely behind the scenes to fix checkpoint metadata in streams created
>> in 3.5.4. The 5 lines of code it takes to address a weird edge case for
>> certain users that's already gone from master shouldn't be a huge deal.
>>
>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]> wrote:
>>
>>>
>>> To Sean, you're right, I'm very sorry.
>>>
>>> From the perspective of compatibility and migratability, I think we
>>> should migrate this logic to 4.0.0 and keep it in the codebase for a longer
>>> time (or permanently), because we can't predict which version users of
>>> 3.5.4 will choose next.
>>>
>>>
>>> I don't want to discuss the so-called vendor issue.
>>>
>>> I withdraw my previous -1.
>>>
>>> Jie Yang.
>>>
>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
>>> > Guys, let’s be honest about what we’re discussing here.
>>> >
>>> > If this is a migration issue, why would we even need a vote? We’ve been
>>> > consistently adding configurations to restore legacy behavior instead
>>> of
>>> > removing them because we understand the challenges of upgrading Spark
>>> > versions. Our goal has always been to make upgrades easier, even if it
>>> > means carrying some technical debt. I don’t think we want to change
>>> that
>>> > culture now.
>>> >
>>> > If the concern is about vendor names appearing in the codebase, then
>>> why is
>>> > it a big deal this time when vendor names are already present
>>> elsewhere? If
>>> > we’ve failed to follow a policy, let’s correct it, but can someone
>>> point to
>>> > the specific policy we’re violating?
>>> >
>>> > If the vote is about adding migration logic to ease the upgrade from
>>> 3.5.4
>>> > to 4.0.0, then +1, why not?
>>> >
>>> > Thanks,
>>> > Wenchen
>>> >
>>> >
>>> >
>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
>>> [email protected]>
>>> > wrote:
>>> >
>>> > > Well said, Sean. Sorry I made you keep around here since it might
>>> not be
>>> > > clearly stated. My bad.
>>> > >
>>> > > Yang, how could we ever tolerate the fact there are "other"
>>> occurrences of
>>> > > vendor names in the codebase? Please go and search "databricks" in
>>> the
>>> > > codebase and be surprised.
>>> > >
>>> > > If we believe that having vendor names in the codebase will increase
>>> > > the occurrence of making mistakes, why didn't we have a discussion
>>> thread
>>> > > earlier to remove all occurrences altogether? This is super tricky
>>> because
>>> > > I can even start to argue we have "Apple" as a vendor name in Apache
>>> Spark
>>> > > codebase. I'm not saying we use "apple" in the test data. See
>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
>>> `isMacOnMSeries` or
>>> > > `isMacOnSilicon` is enough.
>>> > >
>>> > > We really need to draw a line where we disallow vendor names on it -
>>> if
>>> > > it's the entire codebase, I don't really think it is realistic.
>>> > >
>>> > > This was really a mistake, and it was definitely not from referring
>>> to the
>>> > > existing codebase. Not having a vendor name does not change anything
>>> on the
>>> > > chance of encountering this issue again. If we really care, we
>>> should think
>>> > > about style checking, which is the only viable way to catch the
>>> mistake.
>>> > > Again, I'd argue we have to have a bunch of vendor names in that
>>> style
>>> > > check, not just the problematic vendor name.
>>> > >
>>> > >
>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]> wrote:
>>> > >
>>> > >> Doesn't the migration code 'clear' the debt?
>>> > >> The proposal is not to continue to support the config.
>>> > >> I feel like people are not quite understanding the change, and
>>> objecting
>>> > >> to something that doesn't exist.
>>> > >> It's a shame, as this seems like something not even worth
>>> discussing. I
>>> > >> don't know why this triggered this much discussion. We have kept
>>> deprecated
>>> > >> methods without blinking, which is in comparison much bigger.
>>> > >> Can we maybe ask you review the actual change in question?
>>> > >>
>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <[email protected]>
>>> wrote:
>>> > >>
>>> > >>> -1
>>> > >>> Remove migration logic of incorrect `spark.databricks.*`
>>> configuration
>>> > >>> in Spark 4.0.0 because I think this configuration was initially
>>> introduced
>>> > >>> accidentally in Spark 3.5.4, lacking a clear design intent.
>>> Although the
>>> > >>> immediate maintenance cost of retaining this configuration
>>> currently seems
>>> > >>> limited, as subsequent versions iterate and user habits form, it
>>> may lead
>>> > >>> to the continuous accumulation of technical debt. When users come
>>> to view
>>> > >>> this configuration as one that can be relied on long-term, future
>>> removal
>>> > >>> may face greater resistance from users and could potentially
>>> become an
>>> > >>> entrenched and redundant configuration in the codebase. Therefore,
>>> promptly
>>> > >>> correcting this historically accidental configuration not only
>>> maintains
>>> > >>> the normativity of the Spark configuration system but also prevents
>>> > >>> unintended configurations from becoming de facto standards, thereby
>>> > >>> reducing long-term maintenance risks.
>>> > >>>
>>> > >>> Jie Yang
>>> > >>>
>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
>>> > >>> > -1 because there exists a feasible migration path for Apache
>>> Spark
>>> > >>> 3.5.4 via Apache Spark 3.5.5.
>>> > >>> >
>>> > >>> > It's obvious that this Databricks' mistake already causes a huge
>>> > >>> communication cost in the Apache Spark community and is suggesting
>>> a burden
>>> > >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0.
>>> > >>> >
>>> > >>> > Given that, I don't think
>>> > >>> > - This is an inevitable or
>>> > >>> > - This is 0 cost
>>> > >>> >
>>> > >>> > Dongjoon.
>>> > >>> >
>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
>>> > >>> > > Starting from my +1 (non-binding).
>>> > >>> > >
>>> > >>> > > In addition, I propose to retain migration logic till Spark
>>> 4.1.x and
>>> > >>> > > remove it in Spark 4.2.0.
>>> > >>> > >
>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
>>> > >>> [email protected]>
>>> > >>> > > wrote:
>>> > >>> > >
>>> > >>> > > > Hi dev,
>>> > >>> > > >
>>> > >>> > > > Please vote to retain migration logic of incorrect
>>> > >>> `spark.databricks.*`
>>> > >>> > > > configuration in Spark 4.0.x.
>>> > >>> > > >
>>> > >>> > > > - DISCUSSION:
>>> > >>> > > >
>>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being exposed
>>> in
>>> > >>> 3.5.4 in
>>> > >>> > > > Spark 4.0.0+)
>>> > >>> > > >
>>> > >>> > > > Specifically, please review this post
>>> > >>> > > >
>>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
>>> > >>> which
>>> > >>> > > > explains pros and cons about the proposal - proposal is about
>>> > >>> "Option 1".
>>> > >>> > > >
>>> > >>> > > > Simply speaking, this vote is to allow streaming queries
>>> which had
>>> > >>> been
>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x,
>>> "without
>>> > >>> having to
>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes,
>>> we
>>> > >>> will help
>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark
>>> 4.0.x,
>>> > >>> which would
>>> > >>> > > > be almost 1 year.
>>> > >>> > > >
>>> > >>> > > > The (only) cons in this option is having to retain the
>>> incorrect
>>> > >>> > > > configuration name as "string" in the codebase a bit longer.
>>> The
>>> > >>> code
>>> > >>> > > > complexity of migration logic is arguably trivial. (link
>>> > >>> > > > <
>>> > >>>
>>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
>>> > >>> >
>>> > >>> > > > )
>>> > >>> > > >
>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports
>>> including
>>> > >>> migration
>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and
>>> leave
>>> > >>> the
>>> > >>> > > > desired last minor version of Spark to retain this migration
>>> logic.
>>> > >>> > > >
>>> > >>> > > > The vote is open for the next 72 hours and passes if a
>>> majority +1
>>> > >>> PMC
>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
>>> > >>> > > >
>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
>>> `spark.databricks.*`
>>> > >>> > > > configuration in Spark 4.0.x
>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
>>> `spark.databricks.*`
>>> > >>> > > > configuration in Spark 4.0.0 because...
>>> > >>> > > >
>>> > >>> > > > Thanks!
>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
>>> > >>> > > >
>>> > >>> > >
>>> > >>> >
>>> > >>> >
>>> ---------------------------------------------------------------------
>>> > >>> > To unsubscribe e-mail: [email protected]
>>> > >>> >
>>> > >>> >
>>> > >>>
>>> > >>>
>>> ---------------------------------------------------------------------
>>> > >>> To unsubscribe e-mail: [email protected]
>>> > >>>
>>> > >>>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>
>>
>> --
>> Adam Binford
>>
>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to