Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Adam Binford Tue, 11 Mar 2025 09:08:41 -0700

+1 (non-binding)

It's a pretty in the weeds issue with how Structured Streaming works under
the hood that's kinda hard to understand if you're not familiar with it.
The migration logic doesn't mean users can still use the old config, it's
purely behind the scenes to fix checkpoint metadata in streams created in
3.5.4. The 5 lines of code it takes to address a weird edge case for
certain users that's already gone from master shouldn't be a huge deal.


On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]> wrote:

>
> To Sean, you're right, I'm very sorry.
>
> From the perspective of compatibility and migratability, I think we should
> migrate this logic to 4.0.0 and keep it in the codebase for a longer time
> (or permanently), because we can't predict which version users of 3.5.4
> will choose next.
>
>
> I don't want to discuss the so-called vendor issue.
>
> I withdraw my previous -1.
>
> Jie Yang.
>
> On 2025/03/11 04:42:25 Wenchen Fan wrote:
> > Guys, let’s be honest about what we’re discussing here.
> >
> > If this is a migration issue, why would we even need a vote? We’ve been
> > consistently adding configurations to restore legacy behavior instead of
> > removing them because we understand the challenges of upgrading Spark
> > versions. Our goal has always been to make upgrades easier, even if it
> > means carrying some technical debt. I don’t think we want to change that
> > culture now.
> >
> > If the concern is about vendor names appearing in the codebase, then why
> is
> > it a big deal this time when vendor names are already present elsewhere?
> If
> > we’ve failed to follow a policy, let’s correct it, but can someone point
> to
> > the specific policy we’re violating?
> >
> > If the vote is about adding migration logic to ease the upgrade from
> 3.5.4
> > to 4.0.0, then +1, why not?
> >
> > Thanks,
> > Wenchen
> >
> >
> >
> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
> [email protected]>
> > wrote:
> >
> > > Well said, Sean. Sorry I made you keep around here since it might not
> be
> > > clearly stated. My bad.
> > >
> > > Yang, how could we ever tolerate the fact there are "other"
> occurrences of
> > > vendor names in the codebase? Please go and search "databricks" in the
> > > codebase and be surprised.
> > >
> > > If we believe that having vendor names in the codebase will increase
> > > the occurrence of making mistakes, why didn't we have a discussion
> thread
> > > earlier to remove all occurrences altogether? This is super tricky
> because
> > > I can even start to argue we have "Apple" as a vendor name in Apache
> Spark
> > > codebase. I'm not saying we use "apple" in the test data. See
> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
> `isMacOnMSeries` or
> > > `isMacOnSilicon` is enough.
> > >
> > > We really need to draw a line where we disallow vendor names on it - if
> > > it's the entire codebase, I don't really think it is realistic.
> > >
> > > This was really a mistake, and it was definitely not from referring to
> the
> > > existing codebase. Not having a vendor name does not change anything
> on the
> > > chance of encountering this issue again. If we really care, we should
> think
> > > about style checking, which is the only viable way to catch the
> mistake.
> > > Again, I'd argue we have to have a bunch of vendor names in that style
> > > check, not just the problematic vendor name.
> > >
> > >
> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]> wrote:
> > >
> > >> Doesn't the migration code 'clear' the debt?
> > >> The proposal is not to continue to support the config.
> > >> I feel like people are not quite understanding the change, and
> objecting
> > >> to something that doesn't exist.
> > >> It's a shame, as this seems like something not even worth discussing.
> I
> > >> don't know why this triggered this much discussion. We have kept
> deprecated
> > >> methods without blinking, which is in comparison much bigger.
> > >> Can we maybe ask you review the actual change in question?
> > >>
> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <[email protected]> wrote:
> > >>
> > >>> -1
> > >>> Remove migration logic of incorrect `spark.databricks.*`
> configuration
> > >>> in Spark 4.0.0 because I think this configuration was initially
> introduced
> > >>> accidentally in Spark 3.5.4, lacking a clear design intent. Although
> the
> > >>> immediate maintenance cost of retaining this configuration currently
> seems
> > >>> limited, as subsequent versions iterate and user habits form, it may
> lead
> > >>> to the continuous accumulation of technical debt. When users come to
> view
> > >>> this configuration as one that can be relied on long-term, future
> removal
> > >>> may face greater resistance from users and could potentially become
> an
> > >>> entrenched and redundant configuration in the codebase. Therefore,
> promptly
> > >>> correcting this historically accidental configuration not only
> maintains
> > >>> the normativity of the Spark configuration system but also prevents
> > >>> unintended configurations from becoming de facto standards, thereby
> > >>> reducing long-term maintenance risks.
> > >>>
> > >>> Jie Yang
> > >>>
> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
> > >>> > -1 because there exists a feasible migration path for Apache Spark
> > >>> 3.5.4 via Apache Spark 3.5.5.
> > >>> >
> > >>> > It's obvious that this Databricks' mistake already causes a huge
> > >>> communication cost in the Apache Spark community and is suggesting a
> burden
> > >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0.
> > >>> >
> > >>> > Given that, I don't think
> > >>> > - This is an inevitable or
> > >>> > - This is 0 cost
> > >>> >
> > >>> > Dongjoon.
> > >>> >
> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
> > >>> > > Starting from my +1 (non-binding).
> > >>> > >
> > >>> > > In addition, I propose to retain migration logic till Spark
> 4.1.x and
> > >>> > > remove it in Spark 4.2.0.
> > >>> > >
> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
> > >>> [email protected]>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi dev,
> > >>> > > >
> > >>> > > > Please vote to retain migration logic of incorrect
> > >>> `spark.databricks.*`
> > >>> > > > configuration in Spark 4.0.x.
> > >>> > > >
> > >>> > > > - DISCUSSION:
> > >>> > > >
> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being exposed in
> > >>> 3.5.4 in
> > >>> > > > Spark 4.0.0+)
> > >>> > > >
> > >>> > > > Specifically, please review this post
> > >>> > > >
> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
> > >>> which
> > >>> > > > explains pros and cons about the proposal - proposal is about
> > >>> "Option 1".
> > >>> > > >
> > >>> > > > Simply speaking, this vote is to allow streaming queries which
> had
> > >>> been
> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x,
> "without
> > >>> having to
> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes, we
> > >>> will help
> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark 4.0.x,
> > >>> which would
> > >>> > > > be almost 1 year.
> > >>> > > >
> > >>> > > > The (only) cons in this option is having to retain the
> incorrect
> > >>> > > > configuration name as "string" in the codebase a bit longer.
> The
> > >>> code
> > >>> > > > complexity of migration logic is arguably trivial. (link
> > >>> > > > <
> > >>>
> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
> > >>> >
> > >>> > > > )
> > >>> > > >
> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports including
> > >>> migration
> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and
> leave
> > >>> the
> > >>> > > > desired last minor version of Spark to retain this migration
> logic.
> > >>> > > >
> > >>> > > > The vote is open for the next 72 hours and passes if a
> majority +1
> > >>> PMC
> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
> > >>> > > >
> > >>> > > > [ ] +1 Retain migration logic of incorrect `spark.databricks.*`
> > >>> > > > configuration in Spark 4.0.x
> > >>> > > > [ ] -1 Remove migration logic of incorrect `spark.databricks.*`
> > >>> > > > configuration in Spark 4.0.0 because...
> > >>> > > >
> > >>> > > > Thanks!
> > >>> > > > Jungtaek Lim (HeartSaVioR)
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> ---------------------------------------------------------------------
> > >>> > To unsubscribe e-mail: [email protected]
> > >>> >
> > >>> >
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe e-mail: [email protected]
> > >>>
> > >>>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

-- 
Adam Binford

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to