Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Yang Jie Mon, 10 Mar 2025 22:41:35 -0700


To Sean, you're right, I'm very sorry.


>From the perspective of compatibility and migratability, I think we should 
>migrate this logic to 4.0.0 and keep it in the codebase for a longer time (or 
>permanently), because we can't predict which version users of 3.5.4 will 
>choose next. 


I don't want to discuss the so-called vendor issue. 

I withdraw my previous -1. 

Jie Yang.

On 2025/03/11 04:42:25 Wenchen Fan wrote:
> Guys, let’s be honest about what we’re discussing here.
> 
> If this is a migration issue, why would we even need a vote? We’ve been
> consistently adding configurations to restore legacy behavior instead of
> removing them because we understand the challenges of upgrading Spark
> versions. Our goal has always been to make upgrades easier, even if it
> means carrying some technical debt. I don’t think we want to change that
> culture now.
> 
> If the concern is about vendor names appearing in the codebase, then why is
> it a big deal this time when vendor names are already present elsewhere? If
> we’ve failed to follow a policy, let’s correct it, but can someone point to
> the specific policy we’re violating?
> 
> If the vote is about adding migration logic to ease the upgrade from 3.5.4
> to 4.0.0, then +1, why not?
> 
> Thanks,
> Wenchen
> 
> 
> 
> On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <[email protected]>
> wrote:
> 
> > Well said, Sean. Sorry I made you keep around here since it might not be
> > clearly stated. My bad.
> >
> > Yang, how could we ever tolerate the fact there are "other" occurrences of
> > vendor names in the codebase? Please go and search "databricks" in the
> > codebase and be surprised.
> >
> > If we believe that having vendor names in the codebase will increase
> > the occurrence of making mistakes, why didn't we have a discussion thread
> > earlier to remove all occurrences altogether? This is super tricky because
> > I can even start to argue we have "Apple" as a vendor name in Apache Spark
> > codebase. I'm not saying we use "apple" in the test data. See
> > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No, `isMacOnMSeries` or
> > `isMacOnSilicon` is enough.
> >
> > We really need to draw a line where we disallow vendor names on it - if
> > it's the entire codebase, I don't really think it is realistic.
> >
> > This was really a mistake, and it was definitely not from referring to the
> > existing codebase. Not having a vendor name does not change anything on the
> > chance of encountering this issue again. If we really care, we should think
> > about style checking, which is the only viable way to catch the mistake.
> > Again, I'd argue we have to have a bunch of vendor names in that style
> > check, not just the problematic vendor name.
> >
> >
> > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]> wrote:
> >
> >> Doesn't the migration code 'clear' the debt?
> >> The proposal is not to continue to support the config.
> >> I feel like people are not quite understanding the change, and objecting
> >> to something that doesn't exist.
> >> It's a shame, as this seems like something not even worth discussing. I
> >> don't know why this triggered this much discussion. We have kept deprecated
> >> methods without blinking, which is in comparison much bigger.
> >> Can we maybe ask you review the actual change in question?
> >>
> >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <[email protected]> wrote:
> >>
> >>> -1
> >>> Remove migration logic of incorrect `spark.databricks.*` configuration
> >>> in Spark 4.0.0 because I think this configuration was initially introduced
> >>> accidentally in Spark 3.5.4, lacking a clear design intent. Although the
> >>> immediate maintenance cost of retaining this configuration currently seems
> >>> limited, as subsequent versions iterate and user habits form, it may lead
> >>> to the continuous accumulation of technical debt. When users come to view
> >>> this configuration as one that can be relied on long-term, future removal
> >>> may face greater resistance from users and could potentially become an
> >>> entrenched and redundant configuration in the codebase. Therefore, 
> >>> promptly
> >>> correcting this historically accidental configuration not only maintains
> >>> the normativity of the Spark configuration system but also prevents
> >>> unintended configurations from becoming de facto standards, thereby
> >>> reducing long-term maintenance risks.
> >>>
> >>> Jie Yang
> >>>
> >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
> >>> > -1 because there exists a feasible migration path for Apache Spark
> >>> 3.5.4 via Apache Spark 3.5.5.
> >>> >
> >>> > It's obvious that this Databricks' mistake already causes a huge
> >>> communication cost in the Apache Spark community and is suggesting a 
> >>> burden
> >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0.
> >>> >
> >>> > Given that, I don't think
> >>> > - This is an inevitable or
> >>> > - This is 0 cost
> >>> >
> >>> > Dongjoon.
> >>> >
> >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
> >>> > > Starting from my +1 (non-binding).
> >>> > >
> >>> > > In addition, I propose to retain migration logic till Spark 4.1.x and
> >>> > > remove it in Spark 4.2.0.
> >>> > >
> >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
> >>> [email protected]>
> >>> > > wrote:
> >>> > >
> >>> > > > Hi dev,
> >>> > > >
> >>> > > > Please vote to retain migration logic of incorrect
> >>> `spark.databricks.*`
> >>> > > > configuration in Spark 4.0.x.
> >>> > > >
> >>> > > > - DISCUSSION:
> >>> > > > https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
> >>> > > > ([DISCUSS] Handling spark.databricks.* config being exposed in
> >>> 3.5.4 in
> >>> > > > Spark 4.0.0+)
> >>> > > >
> >>> > > > Specifically, please review this post
> >>> > > > https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
> >>> which
> >>> > > > explains pros and cons about the proposal - proposal is about
> >>> "Option 1".
> >>> > > >
> >>> > > > Simply speaking, this vote is to allow streaming queries which had
> >>> been
> >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x, "without
> >>> having to
> >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes, we
> >>> will help
> >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark 4.0.x,
> >>> which would
> >>> > > > be almost 1 year.
> >>> > > >
> >>> > > > The (only) cons in this option is having to retain the incorrect
> >>> > > > configuration name as "string" in the codebase a bit longer. The
> >>> code
> >>> > > > complexity of migration logic is arguably trivial. (link
> >>> > > > <
> >>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
> >>> >
> >>> > > > )
> >>> > > >
> >>> > > > This VOTE is for Spark 4.0.x, but if someone supports including
> >>> migration
> >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and leave
> >>> the
> >>> > > > desired last minor version of Spark to retain this migration logic.
> >>> > > >
> >>> > > > The vote is open for the next 72 hours and passes if a majority +1
> >>> PMC
> >>> > > > votes are cast, with a minimum of 3 +1 votes.
> >>> > > >
> >>> > > > [ ] +1 Retain migration logic of incorrect `spark.databricks.*`
> >>> > > > configuration in Spark 4.0.x
> >>> > > > [ ] -1 Remove migration logic of incorrect `spark.databricks.*`
> >>> > > > configuration in Spark 4.0.0 because...
> >>> > > >
> >>> > > > Thanks!
> >>> > > > Jungtaek Lim (HeartSaVioR)
> >>> > > >
> >>> > >
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe e-mail: [email protected]
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: [email protected]
> >>>
> >>>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to