Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Xiao Li Wed, 12 Mar 2025 23:05:25 -0700

>
> this vote is to allow streaming queries which had been ever run in Spark
> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be upgraded with
> Spark 3.5.5+ in prior".



In the history of Apache Spark, have we ever required users to upgrade to
the next maintenance release before moving to a new feature or major
release?

Xiao

Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：

> +1 (non-binding)
>
> It's a pretty in the weeds issue with how Structured Streaming works under
> the hood that's kinda hard to understand if you're not familiar with it.
> The migration logic doesn't mean users can still use the old config, it's
> purely behind the scenes to fix checkpoint metadata in streams created in
> 3.5.4. The 5 lines of code it takes to address a weird edge case for
> certain users that's already gone from master shouldn't be a huge deal.
>
> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]> wrote:
>
>>
>> To Sean, you're right, I'm very sorry.
>>
>> From the perspective of compatibility and migratability, I think we
>> should migrate this logic to 4.0.0 and keep it in the codebase for a longer
>> time (or permanently), because we can't predict which version users of
>> 3.5.4 will choose next.
>>
>>
>> I don't want to discuss the so-called vendor issue.
>>
>> I withdraw my previous -1.
>>
>> Jie Yang.
>>
>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
>> > Guys, let’s be honest about what we’re discussing here.
>> >
>> > If this is a migration issue, why would we even need a vote? We’ve been
>> > consistently adding configurations to restore legacy behavior instead of
>> > removing them because we understand the challenges of upgrading Spark
>> > versions. Our goal has always been to make upgrades easier, even if it
>> > means carrying some technical debt. I don’t think we want to change that
>> > culture now.
>> >
>> > If the concern is about vendor names appearing in the codebase, then
>> why is
>> > it a big deal this time when vendor names are already present
>> elsewhere? If
>> > we’ve failed to follow a policy, let’s correct it, but can someone
>> point to
>> > the specific policy we’re violating?
>> >
>> > If the vote is about adding migration logic to ease the upgrade from
>> 3.5.4
>> > to 4.0.0, then +1, why not?
>> >
>> > Thanks,
>> > Wenchen
>> >
>> >
>> >
>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
>> [email protected]>
>> > wrote:
>> >
>> > > Well said, Sean. Sorry I made you keep around here since it might not
>> be
>> > > clearly stated. My bad.
>> > >
>> > > Yang, how could we ever tolerate the fact there are "other"
>> occurrences of
>> > > vendor names in the codebase? Please go and search "databricks" in the
>> > > codebase and be surprised.
>> > >
>> > > If we believe that having vendor names in the codebase will increase
>> > > the occurrence of making mistakes, why didn't we have a discussion
>> thread
>> > > earlier to remove all occurrences altogether? This is super tricky
>> because
>> > > I can even start to argue we have "Apple" as a vendor name in Apache
>> Spark
>> > > codebase. I'm not saying we use "apple" in the test data. See
>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
>> `isMacOnMSeries` or
>> > > `isMacOnSilicon` is enough.
>> > >
>> > > We really need to draw a line where we disallow vendor names on it -
>> if
>> > > it's the entire codebase, I don't really think it is realistic.
>> > >
>> > > This was really a mistake, and it was definitely not from referring
>> to the
>> > > existing codebase. Not having a vendor name does not change anything
>> on the
>> > > chance of encountering this issue again. If we really care, we should
>> think
>> > > about style checking, which is the only viable way to catch the
>> mistake.
>> > > Again, I'd argue we have to have a bunch of vendor names in that style
>> > > check, not just the problematic vendor name.
>> > >
>> > >
>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]> wrote:
>> > >
>> > >> Doesn't the migration code 'clear' the debt?
>> > >> The proposal is not to continue to support the config.
>> > >> I feel like people are not quite understanding the change, and
>> objecting
>> > >> to something that doesn't exist.
>> > >> It's a shame, as this seems like something not even worth
>> discussing. I
>> > >> don't know why this triggered this much discussion. We have kept
>> deprecated
>> > >> methods without blinking, which is in comparison much bigger.
>> > >> Can we maybe ask you review the actual change in question?
>> > >>
>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <[email protected]>
>> wrote:
>> > >>
>> > >>> -1
>> > >>> Remove migration logic of incorrect `spark.databricks.*`
>> configuration
>> > >>> in Spark 4.0.0 because I think this configuration was initially
>> introduced
>> > >>> accidentally in Spark 3.5.4, lacking a clear design intent.
>> Although the
>> > >>> immediate maintenance cost of retaining this configuration
>> currently seems
>> > >>> limited, as subsequent versions iterate and user habits form, it
>> may lead
>> > >>> to the continuous accumulation of technical debt. When users come
>> to view
>> > >>> this configuration as one that can be relied on long-term, future
>> removal
>> > >>> may face greater resistance from users and could potentially become
>> an
>> > >>> entrenched and redundant configuration in the codebase. Therefore,
>> promptly
>> > >>> correcting this historically accidental configuration not only
>> maintains
>> > >>> the normativity of the Spark configuration system but also prevents
>> > >>> unintended configurations from becoming de facto standards, thereby
>> > >>> reducing long-term maintenance risks.
>> > >>>
>> > >>> Jie Yang
>> > >>>
>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
>> > >>> > -1 because there exists a feasible migration path for Apache Spark
>> > >>> 3.5.4 via Apache Spark 3.5.5.
>> > >>> >
>> > >>> > It's obvious that this Databricks' mistake already causes a huge
>> > >>> communication cost in the Apache Spark community and is suggesting
>> a burden
>> > >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0.
>> > >>> >
>> > >>> > Given that, I don't think
>> > >>> > - This is an inevitable or
>> > >>> > - This is 0 cost
>> > >>> >
>> > >>> > Dongjoon.
>> > >>> >
>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
>> > >>> > > Starting from my +1 (non-binding).
>> > >>> > >
>> > >>> > > In addition, I propose to retain migration logic till Spark
>> 4.1.x and
>> > >>> > > remove it in Spark 4.2.0.
>> > >>> > >
>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
>> > >>> [email protected]>
>> > >>> > > wrote:
>> > >>> > >
>> > >>> > > > Hi dev,
>> > >>> > > >
>> > >>> > > > Please vote to retain migration logic of incorrect
>> > >>> `spark.databricks.*`
>> > >>> > > > configuration in Spark 4.0.x.
>> > >>> > > >
>> > >>> > > > - DISCUSSION:
>> > >>> > > >
>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being exposed in
>> > >>> 3.5.4 in
>> > >>> > > > Spark 4.0.0+)
>> > >>> > > >
>> > >>> > > > Specifically, please review this post
>> > >>> > > >
>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
>> > >>> which
>> > >>> > > > explains pros and cons about the proposal - proposal is about
>> > >>> "Option 1".
>> > >>> > > >
>> > >>> > > > Simply speaking, this vote is to allow streaming queries
>> which had
>> > >>> been
>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x,
>> "without
>> > >>> having to
>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes,
>> we
>> > >>> will help
>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark
>> 4.0.x,
>> > >>> which would
>> > >>> > > > be almost 1 year.
>> > >>> > > >
>> > >>> > > > The (only) cons in this option is having to retain the
>> incorrect
>> > >>> > > > configuration name as "string" in the codebase a bit longer.
>> The
>> > >>> code
>> > >>> > > > complexity of migration logic is arguably trivial. (link
>> > >>> > > > <
>> > >>>
>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
>> > >>> >
>> > >>> > > > )
>> > >>> > > >
>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports
>> including
>> > >>> migration
>> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and
>> leave
>> > >>> the
>> > >>> > > > desired last minor version of Spark to retain this migration
>> logic.
>> > >>> > > >
>> > >>> > > > The vote is open for the next 72 hours and passes if a
>> majority +1
>> > >>> PMC
>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
>> > >>> > > >
>> > >>> > > > [ ] +1 Retain migration logic of incorrect
>> `spark.databricks.*`
>> > >>> > > > configuration in Spark 4.0.x
>> > >>> > > > [ ] -1 Remove migration logic of incorrect
>> `spark.databricks.*`
>> > >>> > > > configuration in Spark 4.0.0 because...
>> > >>> > > >
>> > >>> > > > Thanks!
>> > >>> > > > Jungtaek Lim (HeartSaVioR)
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>> >
>> ---------------------------------------------------------------------
>> > >>> > To unsubscribe e-mail: [email protected]
>> > >>> >
>> > >>> >
>> > >>>
>> > >>>
>> ---------------------------------------------------------------------
>> > >>> To unsubscribe e-mail: [email protected]
>> > >>>
>> > >>>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>
>
> --
> Adam Binford
>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to