Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Jungtaek Lim Thu, 13 Mar 2025 14:32:45 -0700

Also, I don't believe considering -1 as veto makes sense here, because his
proposal is "somehow" (I'd rather say "accidentally") in the current
codebase and we hadn't had any discussion with that proposal. So if we kill
the VOTE and do nothing, it's effectively saying +1 to his proposal, which
makes zero sense to me.


On Fri, Mar 14, 2025 at 5:57 AM Jungtaek Lim <[email protected]>
wrote:

> I do believe there are two ways of considering -1 vote. Valid -1 votes are
> not restricted to technical objections, but in that case, it must not be
> considered as veto, otherwise we will end up disturbing ourselves. It is
> just an ideal world where we can make consensus on any topic, no, it can't
> be.
>
> Please give me the evidence if you think -1 should be considered as veto,
> otherwise I'll conclude the vote sooner.
>
>
>
> On Fri, Mar 14, 2025 at 12:22 AM Mark Hamstra <[email protected]>
> wrote:
>
>> Valid -1 votes are not restricted to technical objections.
>>
>> On Thu, Mar 13, 2025 at 7:28 AM Sean Owen <[email protected]> wrote:
>> >
>> > I'm not sure if a VOTE is appropriate here, but I also do not see any
>> valid technical objection here. I don't think this can be considered a
>> valid 'veto' even if we were thinking of it that way.
>> > I think there are other non-technical factors influencing this
>> position. I believe we proceed with Jungtaek's proposal.
>> >
>> > On Thu, Mar 13, 2025 at 9:17 AM Dongjoon Hyun <[email protected]>
>> wrote:
>> >>
>> >> We are having this vote to give clarity by keeping all records of the
>> community decisions and stances during building a community consensus. All
>> votes are important and counted.
>> >>
>> >> To Jungtaek, I already casted my veto properly and have been tracking
>> the thread. You don't need to say to me to revisit because I've been here.
>> >>
>> >> To Xiao, in the history of Apache Spark, have we ever made a mistake
>> to ship a vendor-ownership like `spark.databricks.*`? I believe you are
>> switching the real root cause and the bad consequence here.
>> >> > In the history of Apache Spark, have we ever required users to
>> upgrade to the next maintenance release before moving to a new feature or
>> major release?
>> >>
>> >> Thanks,
>> >> Dongjoon.
>> >>
>> >>
>> >> On Thu, Mar 13, 2025 at 12:58 AM Jungtaek Lim <
>> [email protected]> wrote:
>> >>>
>> >>> Thanks to everyone who participated and voted!
>> >>>
>> >>> Now I can technically conclude the VOTE, but I'm willing to wait till
>> US daytime tomorrow, to give some time for Dongjoon to revisit this.
>> >>>
>> >>> I'll conclude the vote around 6PM PST tomorrow regardless of his
>> vote. It's ideal to see us have no -1, but having one -1 doesn't block this
>> vote and we can move forward.
>> >>>
>> >>> On Thu, Mar 13, 2025 at 4:42 PM Yang Jie <[email protected]>
>> wrote:
>> >>>>
>> >>>> forgot to mention in my last reply, my stance is +1
>> >>>>
>> >>>> Jie Yang
>> >>>>
>> >>>> On 2025/03/13 07:08:12 Russell Jurney wrote:
>> >>>> > Sure, +1 non-binding.
>> >>>> >
>> >>>> > On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim <
>> [email protected]>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > Russell,
>> >>>> > >
>> >>>> > > Of course, we hear people' voices who aren't having binding
>> votes as well.
>> >>>> > > Personally I think it's more important than committers/PMC
>> members'  VOTE
>> >>>> > > this time since we can be biased and be far from user experience.
>> >>>> > >
>> >>>> > > Could you please explicitly cast your vote, like +1
>> (non-binding)? You
>> >>>> > > seem to agree with the proposal. Thanks!
>> >>>> > >
>> >>>> > > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney <
>> [email protected]>
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > >> I'm just a lurker and aspiring contributor, but as a Spark user
>> upgrading
>> >>>> > >> twice is very confusing and would cause many or most users to
>> fail to
>> >>>> > >> upgrade successfully to Spark 4 on a first go. That seems like
>> a very bad
>> >>>> > >> user experience. I thought it was worthwhile stating this out
>> loud.
>> >>>> > >>
>> >>>> > >> Russell
>> >>>> > >>
>> >>>> > >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <[email protected]>
>> wrote:
>> >>>> > >>
>> >>>> > >>> this vote is to allow streaming queries which had been ever
>> run in Spark
>> >>>> > >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be
>> upgraded with
>> >>>> > >>>> Spark 3.5.5+ in prior".
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> In the history of Apache Spark, have we ever required users to
>> upgrade
>> >>>> > >>> to the next maintenance release before moving to a new feature
>> or major
>> >>>> > >>> release?
>> >>>> > >>>
>> >>>> > >>> Xiao
>> >>>> > >>>
>> >>>> > >>> Adam Binford <[email protected]> 于2025年3月11日周二 09:08写道：
>> >>>> > >>>
>> >>>> > >>>> +1 (non-binding)
>> >>>> > >>>>
>> >>>> > >>>> It's a pretty in the weeds issue with how Structured
>> Streaming works
>> >>>> > >>>> under the hood that's kinda hard to understand if you're not
>> familiar with
>> >>>> > >>>> it. The migration logic doesn't mean users can still use the
>> old config,
>> >>>> > >>>> it's purely behind the scenes to fix checkpoint metadata in
>> streams created
>> >>>> > >>>> in 3.5.4. The 5 lines of code it takes to address a weird
>> edge case for
>> >>>> > >>>> certain users that's already gone from master shouldn't be a
>> huge deal.
>> >>>> > >>>>
>> >>>> > >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <
>> [email protected]> wrote:
>> >>>> > >>>>
>> >>>> > >>>>>
>> >>>> > >>>>> To Sean, you're right, I'm very sorry.
>> >>>> > >>>>>
>> >>>> > >>>>> From the perspective of compatibility and migratability, I
>> think we
>> >>>> > >>>>> should migrate this logic to 4.0.0 and keep it in the
>> codebase for a longer
>> >>>> > >>>>> time (or permanently), because we can't predict which
>> version users of
>> >>>> > >>>>> 3.5.4 will choose next.
>> >>>> > >>>>>
>> >>>> > >>>>>
>> >>>> > >>>>> I don't want to discuss the so-called vendor issue.
>> >>>> > >>>>>
>> >>>> > >>>>> I withdraw my previous -1.
>> >>>> > >>>>>
>> >>>> > >>>>> Jie Yang.
>> >>>> > >>>>>
>> >>>> > >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote:
>> >>>> > >>>>> > Guys, let’s be honest about what we’re discussing here.
>> >>>> > >>>>> >
>> >>>> > >>>>> > If this is a migration issue, why would we even need a
>> vote? We’ve
>> >>>> > >>>>> been
>> >>>> > >>>>> > consistently adding configurations to restore legacy
>> behavior
>> >>>> > >>>>> instead of
>> >>>> > >>>>> > removing them because we understand the challenges of
>> upgrading Spark
>> >>>> > >>>>> > versions. Our goal has always been to make upgrades
>> easier, even if
>> >>>> > >>>>> it
>> >>>> > >>>>> > means carrying some technical debt. I don’t think we want
>> to change
>> >>>> > >>>>> that
>> >>>> > >>>>> > culture now.
>> >>>> > >>>>> >
>> >>>> > >>>>> > If the concern is about vendor names appearing in the
>> codebase, then
>> >>>> > >>>>> why is
>> >>>> > >>>>> > it a big deal this time when vendor names are already
>> present
>> >>>> > >>>>> elsewhere? If
>> >>>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can
>> someone
>> >>>> > >>>>> point to
>> >>>> > >>>>> > the specific policy we’re violating?
>> >>>> > >>>>> >
>> >>>> > >>>>> > If the vote is about adding migration logic to ease the
>> upgrade from
>> >>>> > >>>>> 3.5.4
>> >>>> > >>>>> > to 4.0.0, then +1, why not?
>> >>>> > >>>>> >
>> >>>> > >>>>> > Thanks,
>> >>>> > >>>>> > Wenchen
>> >>>> > >>>>> >
>> >>>> > >>>>> >
>> >>>> > >>>>> >
>> >>>> > >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim <
>> >>>> > >>>>> [email protected]>
>> >>>> > >>>>> > wrote:
>> >>>> > >>>>> >
>> >>>> > >>>>> > > Well said, Sean. Sorry I made you keep around here since
>> it might
>> >>>> > >>>>> not be
>> >>>> > >>>>> > > clearly stated. My bad.
>> >>>> > >>>>> > >
>> >>>> > >>>>> > > Yang, how could we ever tolerate the fact there are
>> "other"
>> >>>> > >>>>> occurrences of
>> >>>> > >>>>> > > vendor names in the codebase? Please go and search
>> "databricks" in
>> >>>> > >>>>> the
>> >>>> > >>>>> > > codebase and be surprised.
>> >>>> > >>>>> > >
>> >>>> > >>>>> > > If we believe that having vendor names in the codebase
>> will
>> >>>> > >>>>> increase
>> >>>> > >>>>> > > the occurrence of making mistakes, why didn't we have a
>> discussion
>> >>>> > >>>>> thread
>> >>>> > >>>>> > > earlier to remove all occurrences altogether? This is
>> super tricky
>> >>>> > >>>>> because
>> >>>> > >>>>> > > I can even start to argue we have "Apple" as a vendor
>> name in
>> >>>> > >>>>> Apache Spark
>> >>>> > >>>>> > > codebase. I'm not saying we use "apple" in the test
>> data. See
>> >>>> > >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No,
>> >>>> > >>>>> `isMacOnMSeries` or
>> >>>> > >>>>> > > `isMacOnSilicon` is enough.
>> >>>> > >>>>> > >
>> >>>> > >>>>> > > We really need to draw a line where we disallow vendor
>> names on it
>> >>>> > >>>>> - if
>> >>>> > >>>>> > > it's the entire codebase, I don't really think it is
>> realistic.
>> >>>> > >>>>> > >
>> >>>> > >>>>> > > This was really a mistake, and it was definitely not from
>> >>>> > >>>>> referring to the
>> >>>> > >>>>> > > existing codebase. Not having a vendor name does not
>> change
>> >>>> > >>>>> anything on the
>> >>>> > >>>>> > > chance of encountering this issue again. If we really
>> care, we
>> >>>> > >>>>> should think
>> >>>> > >>>>> > > about style checking, which is the only viable way to
>> catch the
>> >>>> > >>>>> mistake.
>> >>>> > >>>>> > > Again, I'd argue we have to have a bunch of vendor names
>> in that
>> >>>> > >>>>> style
>> >>>> > >>>>> > > check, not just the problematic vendor name.
>> >>>> > >>>>> > >
>> >>>> > >>>>> > >
>> >>>> > >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <
>> [email protected]>
>> >>>> > >>>>> wrote:
>> >>>> > >>>>> > >
>> >>>> > >>>>> > >> Doesn't the migration code 'clear' the debt?
>> >>>> > >>>>> > >> The proposal is not to continue to support the config.
>> >>>> > >>>>> > >> I feel like people are not quite understanding the
>> change, and
>> >>>> > >>>>> objecting
>> >>>> > >>>>> > >> to something that doesn't exist.
>> >>>> > >>>>> > >> It's a shame, as this seems like something not even
>> worth
>> >>>> > >>>>> discussing. I
>> >>>> > >>>>> > >> don't know why this triggered this much discussion. We
>> have kept
>> >>>> > >>>>> deprecated
>> >>>> > >>>>> > >> methods without blinking, which is in comparison much
>> bigger.
>> >>>> > >>>>> > >> Can we maybe ask you review the actual change in
>> question?
>> >>>> > >>>>> > >>
>> >>>> > >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <
>> [email protected]>
>> >>>> > >>>>> wrote:
>> >>>> > >>>>> > >>
>> >>>> > >>>>> > >>> -1
>> >>>> > >>>>> > >>> Remove migration logic of incorrect
>> `spark.databricks.*`
>> >>>> > >>>>> configuration
>> >>>> > >>>>> > >>> in Spark 4.0.0 because I think this configuration was
>> initially
>> >>>> > >>>>> introduced
>> >>>> > >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design
>> intent.
>> >>>> > >>>>> Although the
>> >>>> > >>>>> > >>> immediate maintenance cost of retaining this
>> configuration
>> >>>> > >>>>> currently seems
>> >>>> > >>>>> > >>> limited, as subsequent versions iterate and user
>> habits form, it
>> >>>> > >>>>> may lead
>> >>>> > >>>>> > >>> to the continuous accumulation of technical debt. When
>> users
>> >>>> > >>>>> come to view
>> >>>> > >>>>> > >>> this configuration as one that can be relied on
>> long-term,
>> >>>> > >>>>> future removal
>> >>>> > >>>>> > >>> may face greater resistance from users and could
>> potentially
>> >>>> > >>>>> become an
>> >>>> > >>>>> > >>> entrenched and redundant configuration in the codebase.
>> >>>> > >>>>> Therefore, promptly
>> >>>> > >>>>> > >>> correcting this historically accidental configuration
>> not only
>> >>>> > >>>>> maintains
>> >>>> > >>>>> > >>> the normativity of the Spark configuration system but
>> also
>> >>>> > >>>>> prevents
>> >>>> > >>>>> > >>> unintended configurations from becoming de facto
>> standards,
>> >>>> > >>>>> thereby
>> >>>> > >>>>> > >>> reducing long-term maintenance risks.
>> >>>> > >>>>> > >>>
>> >>>> > >>>>> > >>> Jie Yang
>> >>>> > >>>>> > >>>
>> >>>> > >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote:
>> >>>> > >>>>> > >>> > -1 because there exists a feasible migration path
>> for Apache
>> >>>> > >>>>> Spark
>> >>>> > >>>>> > >>> 3.5.4 via Apache Spark 3.5.5.
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> > It's obvious that this Databricks' mistake already
>> causes a
>> >>>> > >>>>> huge
>> >>>> > >>>>> > >>> communication cost in the Apache Spark community and is
>> >>>> > >>>>> suggesting a burden
>> >>>> > >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0
>> and 4.1.0.
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> > Given that, I don't think
>> >>>> > >>>>> > >>> > - This is an inevitable or
>> >>>> > >>>>> > >>> > - This is 0 cost
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> > Dongjoon.
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote:
>> >>>> > >>>>> > >>> > > Starting from my +1 (non-binding).
>> >>>> > >>>>> > >>> > >
>> >>>> > >>>>> > >>> > > In addition, I propose to retain migration logic
>> till Spark
>> >>>> > >>>>> 4.1.x and
>> >>>> > >>>>> > >>> > > remove it in Spark 4.2.0.
>> >>>> > >>>>> > >>> > >
>> >>>> > >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim <
>> >>>> > >>>>> > >>> [email protected]>
>> >>>> > >>>>> > >>> > > wrote:
>> >>>> > >>>>> > >>> > >
>> >>>> > >>>>> > >>> > > > Hi dev,
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > Please vote to retain migration logic of
>> incorrect
>> >>>> > >>>>> > >>> `spark.databricks.*`
>> >>>> > >>>>> > >>> > > > configuration in Spark 4.0.x.
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > - DISCUSSION:
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>>
>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr
>> >>>> > >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config
>> being
>> >>>> > >>>>> exposed in
>> >>>> > >>>>> > >>> 3.5.4 in
>> >>>> > >>>>> > >>> > > > Spark 4.0.0+)
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > Specifically, please review this post
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>>
>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k
>> >>>> > >>>>> > >>> which
>> >>>> > >>>>> > >>> > > > explains pros and cons about the proposal -
>> proposal is
>> >>>> > >>>>> about
>> >>>> > >>>>> > >>> "Option 1".
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > Simply speaking, this vote is to allow streaming
>> queries
>> >>>> > >>>>> which had
>> >>>> > >>>>> > >>> been
>> >>>> > >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with
>> Spark 4.0.x,
>> >>>> > >>>>> "without
>> >>>> > >>>>> > >>> having to
>> >>>> > >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the
>> vote
>> >>>> > >>>>> passes, we
>> >>>> > >>>>> > >>> will help
>> >>>> > >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4
>> to Spark
>> >>>> > >>>>> 4.0.x,
>> >>>> > >>>>> > >>> which would
>> >>>> > >>>>> > >>> > > > be almost 1 year.
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > The (only) cons in this option is having to
>> retain the
>> >>>> > >>>>> incorrect
>> >>>> > >>>>> > >>> > > > configuration name as "string" in the codebase a
>> bit
>> >>>> > >>>>> longer. The
>> >>>> > >>>>> > >>> code
>> >>>> > >>>>> > >>> > > > complexity of migration logic is arguably
>> trivial. (link
>> >>>> > >>>>> > >>> > > > <
>> >>>> > >>>>> > >>>
>> >>>> > >>>>>
>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> > > > )
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone
>> supports
>> >>>> > >>>>> including
>> >>>> > >>>>> > >>> migration
>> >>>> > >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast
>> +1 here
>> >>>> > >>>>> and leave
>> >>>> > >>>>> > >>> the
>> >>>> > >>>>> > >>> > > > desired last minor version of Spark to retain
>> this
>> >>>> > >>>>> migration logic.
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > The vote is open for the next 72 hours and
>> passes if a
>> >>>> > >>>>> majority +1
>> >>>> > >>>>> > >>> PMC
>> >>>> > >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect
>> >>>> > >>>>> `spark.databricks.*`
>> >>>> > >>>>> > >>> > > > configuration in Spark 4.0.x
>> >>>> > >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect
>> >>>> > >>>>> `spark.databricks.*`
>> >>>> > >>>>> > >>> > > > configuration in Spark 4.0.0 because...
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > > > Thanks!
>> >>>> > >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR)
>> >>>> > >>>>> > >>> > > >
>> >>>> > >>>>> > >>> > >
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>>
>> ---------------------------------------------------------------------
>> >>>> > >>>>> > >>> > To unsubscribe e-mail:
>> [email protected]
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>> >
>> >>>> > >>>>> > >>>
>> >>>> > >>>>> > >>>
>> >>>> > >>>>>
>> ---------------------------------------------------------------------
>> >>>> > >>>>> > >>> To unsubscribe e-mail:
>> [email protected]
>> >>>> > >>>>> > >>>
>> >>>> > >>>>> > >>>
>> >>>> > >>>>> >
>> >>>> > >>>>>
>> >>>> > >>>>>
>> ---------------------------------------------------------------------
>> >>>> > >>>>> To unsubscribe e-mail: [email protected]
>> >>>> > >>>>>
>> >>>> > >>>>>
>> >>>> > >>>>
>> >>>> > >>>> --
>> >>>> > >>>> Adam Binford
>> >>>> > >>>>
>> >>>> > >>>
>> >>>> >
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe e-mail: [email protected]
>> >>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

Reply via email to