Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Yang Jie Thu, 13 Mar 2025 00:40:33 -0700

forgot to mention in my last reply, my stance is +1

Jie Yang


On 2025/03/13 07:08:12 Russell Jurney wrote:
> Sure, +1 non-binding.
> 
> On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim <[email protected]>
> wrote:
> 
> > Russell,
> >
> > Of course, we hear people' voices who aren't having binding votes as well.
> > Personally I think it's more important than committers/PMC members'  VOTE
> > this time since we can be biased and be far from user experience.
> >
> > Could you please explicitly cast your vote, like +1 (non-binding)? You
> > seem to agree with the proposal. Thanks!
> >
> > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney <[email protected]>
> > wrote:
> >
> >> I'm just a lurker and aspiring contributor, but as a Spark user upgrading
> >> twice is very confusing and would cause many or most users to fail to
> >> upgrade successfully to Spark 4 on a first go. That seems like a very bad
> >> user experience. I thought it was worthwhile stating this out loud.
> >>
> >> Russell
> >>
> >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]> wrote:
> >>
> >>> this vote is to allow streaming queries which had been ever run in Spark
> >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be upgraded 
> >>>> with
> >>>> Spark 3.5.5+ in prior".
> >>>
> >>>
> >>> In the history of Apache Spark, have we ever required users to upgrade
> >>> to the next maintenance release before moving to a new feature or major
> >>> release?
> >>>
> >>> Xiao
> >>>
> >>> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
> >>>
> >>>> +1 (non-binding)
> >>>>
> >>>> It's a pretty in the weeds issue with how Structured Streaming works
> >>>> under the hood that's kinda hard to understand if you're not familiar 
> >>>> with
> >>>> it. The migration logic doesn't mean users can still use the old config,
> >>>> it's purely behind the scenes to fix checkpoint metadata in streams 
> >>>> created
> >>>> in 3.5.4. The 5 lines of code it takes to address a weird edge case for
> >>>> certain users that's already gone from master shouldn't be a huge deal.
> >>>>
> >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <[email protected]> wrote:
> >>>>
> >>>>>
> >>>>> To Sean, you're right, I'm very sorry.
> >>>>>
> >>>>> From the perspective of compatibility and migratability, I think we
> >>>>> should migrate this logic to 4.0.0 and keep it in the codebase for a 
> >>>>> longer
> >>>>> time (or permanently), because we can't predict which version users of
> >>>>> 3.5.4 will choose next.
> >>>>>
> >>>>>
> >>>>> I don't want to discuss the so-called vendor issue.
> >>>>>
> >>>>> I withdraw my previous -1.
> >>>>>
> >>>>> Jie Yang.
> >>>>>
> >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
> >>>>> > Guys, let’s be honest about what we’re discussing here.
> >>>>> >
> >>>>> > If this is a migration issue, why would we even need a vote? We’ve
> >>>>> been
> >>>>> > consistently adding configurations to restore legacy behavior
> >>>>> instead of
> >>>>> > removing them because we understand the challenges of upgrading Spark
> >>>>> > versions. Our goal has always been to make upgrades easier, even if
> >>>>> it
> >>>>> > means carrying some technical debt. I don’t think we want to change
> >>>>> that
> >>>>> > culture now.
> >>>>> >
> >>>>> > If the concern is about vendor names appearing in the codebase, then
> >>>>> why is
> >>>>> > it a big deal this time when vendor names are already present
> >>>>> elsewhere? If
> >>>>> > we’ve failed to follow a policy, let’s correct it, but can someone
> >>>>> point to
> >>>>> > the specific policy we’re violating?
> >>>>> >
> >>>>> > If the vote is about adding migration logic to ease the upgrade from
> >>>>> 3.5.4
> >>>>> > to 4.0.0, then +1, why not?
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Wenchen
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
> >>>>> [email protected]>
> >>>>> > wrote:
> >>>>> >
> >>>>> > > Well said, Sean. Sorry I made you keep around here since it might
> >>>>> not be
> >>>>> > > clearly stated. My bad.
> >>>>> > >
> >>>>> > > Yang, how could we ever tolerate the fact there are "other"
> >>>>> occurrences of
> >>>>> > > vendor names in the codebase? Please go and search "databricks" in
> >>>>> the
> >>>>> > > codebase and be surprised.
> >>>>> > >
> >>>>> > > If we believe that having vendor names in the codebase will
> >>>>> increase
> >>>>> > > the occurrence of making mistakes, why didn't we have a discussion
> >>>>> thread
> >>>>> > > earlier to remove all occurrences altogether? This is super tricky
> >>>>> because
> >>>>> > > I can even start to argue we have "Apple" as a vendor name in
> >>>>> Apache Spark
> >>>>> > > codebase. I'm not saying we use "apple" in the test data. See
> >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
> >>>>> `isMacOnMSeries` or
> >>>>> > > `isMacOnSilicon` is enough.
> >>>>> > >
> >>>>> > > We really need to draw a line where we disallow vendor names on it
> >>>>> - if
> >>>>> > > it's the entire codebase, I don't really think it is realistic.
> >>>>> > >
> >>>>> > > This was really a mistake, and it was definitely not from
> >>>>> referring to the
> >>>>> > > existing codebase. Not having a vendor name does not change
> >>>>> anything on the
> >>>>> > > chance of encountering this issue again. If we really care, we
> >>>>> should think
> >>>>> > > about style checking, which is the only viable way to catch the
> >>>>> mistake.
> >>>>> > > Again, I'd argue we have to have a bunch of vendor names in that
> >>>>> style
> >>>>> > > check, not just the problematic vendor name.
> >>>>> > >
> >>>>> > >
> >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <[email protected]>
> >>>>> wrote:
> >>>>> > >
> >>>>> > >> Doesn't the migration code 'clear' the debt?
> >>>>> > >> The proposal is not to continue to support the config.
> >>>>> > >> I feel like people are not quite understanding the change, and
> >>>>> objecting
> >>>>> > >> to something that doesn't exist.
> >>>>> > >> It's a shame, as this seems like something not even worth
> >>>>> discussing. I
> >>>>> > >> don't know why this triggered this much discussion. We have kept
> >>>>> deprecated
> >>>>> > >> methods without blinking, which is in comparison much bigger.
> >>>>> > >> Can we maybe ask you review the actual change in question?
> >>>>> > >>
> >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <[email protected]>
> >>>>> wrote:
> >>>>> > >>
> >>>>> > >>> -1
> >>>>> > >>> Remove migration logic of incorrect `spark.databricks.*`
> >>>>> configuration
> >>>>> > >>> in Spark 4.0.0 because I think this configuration was initially
> >>>>> introduced
> >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design intent.
> >>>>> Although the
> >>>>> > >>> immediate maintenance cost of retaining this configuration
> >>>>> currently seems
> >>>>> > >>> limited, as subsequent versions iterate and user habits form, it
> >>>>> may lead
> >>>>> > >>> to the continuous accumulation of technical debt. When users
> >>>>> come to view
> >>>>> > >>> this configuration as one that can be relied on long-term,
> >>>>> future removal
> >>>>> > >>> may face greater resistance from users and could potentially
> >>>>> become an
> >>>>> > >>> entrenched and redundant configuration in the codebase.
> >>>>> Therefore, promptly
> >>>>> > >>> correcting this historically accidental configuration not only
> >>>>> maintains
> >>>>> > >>> the normativity of the Spark configuration system but also
> >>>>> prevents
> >>>>> > >>> unintended configurations from becoming de facto standards,
> >>>>> thereby
> >>>>> > >>> reducing long-term maintenance risks.
> >>>>> > >>>
> >>>>> > >>> Jie Yang
> >>>>> > >>>
> >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
> >>>>> > >>> > -1 because there exists a feasible migration path for Apache
> >>>>> Spark
> >>>>> > >>> 3.5.4 via Apache Spark 3.5.5.
> >>>>> > >>> >
> >>>>> > >>> > It's obvious that this Databricks' mistake already causes a
> >>>>> huge
> >>>>> > >>> communication cost in the Apache Spark community and is
> >>>>> suggesting a burden
> >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0.
> >>>>> > >>> >
> >>>>> > >>> > Given that, I don't think
> >>>>> > >>> > - This is an inevitable or
> >>>>> > >>> > - This is 0 cost
> >>>>> > >>> >
> >>>>> > >>> > Dongjoon.
> >>>>> > >>> >
> >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
> >>>>> > >>> > > Starting from my +1 (non-binding).
> >>>>> > >>> > >
> >>>>> > >>> > > In addition, I propose to retain migration logic till Spark
> >>>>> 4.1.x and
> >>>>> > >>> > > remove it in Spark 4.2.0.
> >>>>> > >>> > >
> >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
> >>>>> > >>> [email protected]>
> >>>>> > >>> > > wrote:
> >>>>> > >>> > >
> >>>>> > >>> > > > Hi dev,
> >>>>> > >>> > > >
> >>>>> > >>> > > > Please vote to retain migration logic of incorrect
> >>>>> > >>> `spark.databricks.*`
> >>>>> > >>> > > > configuration in Spark 4.0.x.
> >>>>> > >>> > > >
> >>>>> > >>> > > > - DISCUSSION:
> >>>>> > >>> > > >
> >>>>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
> >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being
> >>>>> exposed in
> >>>>> > >>> 3.5.4 in
> >>>>> > >>> > > > Spark 4.0.0+)
> >>>>> > >>> > > >
> >>>>> > >>> > > > Specifically, please review this post
> >>>>> > >>> > > >
> >>>>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
> >>>>> > >>> which
> >>>>> > >>> > > > explains pros and cons about the proposal - proposal is
> >>>>> about
> >>>>> > >>> "Option 1".
> >>>>> > >>> > > >
> >>>>> > >>> > > > Simply speaking, this vote is to allow streaming queries
> >>>>> which had
> >>>>> > >>> been
> >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x,
> >>>>> "without
> >>>>> > >>> having to
> >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote
> >>>>> passes, we
> >>>>> > >>> will help
> >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark
> >>>>> 4.0.x,
> >>>>> > >>> which would
> >>>>> > >>> > > > be almost 1 year.
> >>>>> > >>> > > >
> >>>>> > >>> > > > The (only) cons in this option is having to retain the
> >>>>> incorrect
> >>>>> > >>> > > > configuration name as "string" in the codebase a bit
> >>>>> longer. The
> >>>>> > >>> code
> >>>>> > >>> > > > complexity of migration logic is arguably trivial. (link
> >>>>> > >>> > > > <
> >>>>> > >>>
> >>>>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
> >>>>> > >>> >
> >>>>> > >>> > > > )
> >>>>> > >>> > > >
> >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports
> >>>>> including
> >>>>> > >>> migration
> >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here
> >>>>> and leave
> >>>>> > >>> the
> >>>>> > >>> > > > desired last minor version of Spark to retain this
> >>>>> migration logic.
> >>>>> > >>> > > >
> >>>>> > >>> > > > The vote is open for the next 72 hours and passes if a
> >>>>> majority +1
> >>>>> > >>> PMC
> >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
> >>>>> > >>> > > >
> >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
> >>>>> `spark.databricks.*`
> >>>>> > >>> > > > configuration in Spark 4.0.x
> >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
> >>>>> `spark.databricks.*`
> >>>>> > >>> > > > configuration in Spark 4.0.0 because...
> >>>>> > >>> > > >
> >>>>> > >>> > > > Thanks!
> >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
> >>>>> > >>> > > >
> >>>>> > >>> > >
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> ---------------------------------------------------------------------
> >>>>> > >>> > To unsubscribe e-mail: [email protected]
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>>
> >>>>> > >>>
> >>>>> ---------------------------------------------------------------------
> >>>>> > >>> To unsubscribe e-mail: [email protected]
> >>>>> > >>>
> >>>>> > >>>
> >>>>> >
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe e-mail: [email protected]
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Adam Binford
> >>>>
> >>>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to