I'm not sure if a VOTE is appropriate here, but I also do not see any valid technical objection here. I don't think this can be considered a valid 'veto' even if we were thinking of it that way. I think there are other non-technical factors influencing this position. I believe we proceed with Jungtaek's proposal.
On Thu, Mar 13, 2025 at 9:17 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > We are having this vote to give clarity by keeping all records of the > community decisions and stances during building a community consensus. All > votes are important and counted. > > To Jungtaek, I already casted my veto properly and have been tracking the > thread. You don't need to say to me to revisit because I've been here. > > To Xiao, in the history of Apache Spark, have we ever made a mistake to > ship a vendor-ownership like `spark.databricks.*`? I believe you are > switching the real root cause and the bad consequence here. > > In the history of Apache Spark, have we ever required users to upgrade > to the next maintenance release before moving to a new feature or major > release? > > Thanks, > Dongjoon. > > > On Thu, Mar 13, 2025 at 12:58 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Thanks to everyone who participated and voted! >> >> Now I can technically conclude the VOTE, but I'm willing to wait till US >> daytime tomorrow, to give some time for Dongjoon to revisit this. >> >> I'll conclude the vote around 6PM PST tomorrow regardless of his vote. >> It's ideal to see us have no -1, but having one -1 doesn't block this >> vote and we can move forward. >> >> On Thu, Mar 13, 2025 at 4:42 PM Yang Jie <yangji...@apache.org> wrote: >> >>> forgot to mention in my last reply, my stance is +1 >>> >>> Jie Yang >>> >>> On 2025/03/13 07:08:12 Russell Jurney wrote: >>> > Sure, +1 non-binding. >>> > >>> > On Wed, Mar 12, 2025 at 11:18 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> >>> > wrote: >>> > >>> > > Russell, >>> > > >>> > > Of course, we hear people' voices who aren't having binding votes as >>> well. >>> > > Personally I think it's more important than committers/PMC members' >>> VOTE >>> > > this time since we can be biased and be far from user experience. >>> > > >>> > > Could you please explicitly cast your vote, like +1 (non-binding)? >>> You >>> > > seem to agree with the proposal. Thanks! >>> > > >>> > > On Thu, Mar 13, 2025 at 3:15 PM Russell Jurney < >>> russell.jur...@gmail.com> >>> > > wrote: >>> > > >>> > >> I'm just a lurker and aspiring contributor, but as a Spark user >>> upgrading >>> > >> twice is very confusing and would cause many or most users to fail >>> to >>> > >> upgrade successfully to Spark 4 on a first go. That seems like a >>> very bad >>> > >> user experience. I thought it was worthwhile stating this out loud. >>> > >> >>> > >> Russell >>> > >> >>> > >> On Wed, Mar 12, 2025 at 11:05 PM Xiao Li <gatorsm...@gmail.com> >>> wrote: >>> > >> >>> > >>> this vote is to allow streaming queries which had been ever run in >>> Spark >>> > >>>> 3.5.4 to be upgraded with Spark 4.0.x, "without having to be >>> upgraded with >>> > >>>> Spark 3.5.5+ in prior". >>> > >>> >>> > >>> >>> > >>> In the history of Apache Spark, have we ever required users to >>> upgrade >>> > >>> to the next maintenance release before moving to a new feature or >>> major >>> > >>> release? >>> > >>> >>> > >>> Xiao >>> > >>> >>> > >>> Adam Binford <adam...@gmail.com> 于2025年3月11日周二 09:08写道: >>> > >>> >>> > >>>> +1 (non-binding) >>> > >>>> >>> > >>>> It's a pretty in the weeds issue with how Structured Streaming >>> works >>> > >>>> under the hood that's kinda hard to understand if you're not >>> familiar with >>> > >>>> it. The migration logic doesn't mean users can still use the old >>> config, >>> > >>>> it's purely behind the scenes to fix checkpoint metadata in >>> streams created >>> > >>>> in 3.5.4. The 5 lines of code it takes to address a weird edge >>> case for >>> > >>>> certain users that's already gone from master shouldn't be a huge >>> deal. >>> > >>>> >>> > >>>> On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <yangji...@apache.org> >>> wrote: >>> > >>>> >>> > >>>>> >>> > >>>>> To Sean, you're right, I'm very sorry. >>> > >>>>> >>> > >>>>> From the perspective of compatibility and migratability, I think >>> we >>> > >>>>> should migrate this logic to 4.0.0 and keep it in the codebase >>> for a longer >>> > >>>>> time (or permanently), because we can't predict which version >>> users of >>> > >>>>> 3.5.4 will choose next. >>> > >>>>> >>> > >>>>> >>> > >>>>> I don't want to discuss the so-called vendor issue. >>> > >>>>> >>> > >>>>> I withdraw my previous -1. >>> > >>>>> >>> > >>>>> Jie Yang. >>> > >>>>> >>> > >>>>> On 2025/03/11 04:42:25 Wenchen Fan wrote: >>> > >>>>> > Guys, let’s be honest about what we’re discussing here. >>> > >>>>> > >>> > >>>>> > If this is a migration issue, why would we even need a vote? >>> We’ve >>> > >>>>> been >>> > >>>>> > consistently adding configurations to restore legacy behavior >>> > >>>>> instead of >>> > >>>>> > removing them because we understand the challenges of >>> upgrading Spark >>> > >>>>> > versions. Our goal has always been to make upgrades easier, >>> even if >>> > >>>>> it >>> > >>>>> > means carrying some technical debt. I don’t think we want to >>> change >>> > >>>>> that >>> > >>>>> > culture now. >>> > >>>>> > >>> > >>>>> > If the concern is about vendor names appearing in the >>> codebase, then >>> > >>>>> why is >>> > >>>>> > it a big deal this time when vendor names are already present >>> > >>>>> elsewhere? If >>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can >>> someone >>> > >>>>> point to >>> > >>>>> > the specific policy we’re violating? >>> > >>>>> > >>> > >>>>> > If the vote is about adding migration logic to ease the >>> upgrade from >>> > >>>>> 3.5.4 >>> > >>>>> > to 4.0.0, then +1, why not? >>> > >>>>> > >>> > >>>>> > Thanks, >>> > >>>>> > Wenchen >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim < >>> > >>>>> kabhwan.opensou...@gmail.com> >>> > >>>>> > wrote: >>> > >>>>> > >>> > >>>>> > > Well said, Sean. Sorry I made you keep around here since it >>> might >>> > >>>>> not be >>> > >>>>> > > clearly stated. My bad. >>> > >>>>> > > >>> > >>>>> > > Yang, how could we ever tolerate the fact there are "other" >>> > >>>>> occurrences of >>> > >>>>> > > vendor names in the codebase? Please go and search >>> "databricks" in >>> > >>>>> the >>> > >>>>> > > codebase and be surprised. >>> > >>>>> > > >>> > >>>>> > > If we believe that having vendor names in the codebase will >>> > >>>>> increase >>> > >>>>> > > the occurrence of making mistakes, why didn't we have a >>> discussion >>> > >>>>> thread >>> > >>>>> > > earlier to remove all occurrences altogether? This is super >>> tricky >>> > >>>>> because >>> > >>>>> > > I can even start to argue we have "Apple" as a vendor name in >>> > >>>>> Apache Spark >>> > >>>>> > > codebase. I'm not saying we use "apple" in the test data. See >>> > >>>>> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No, >>> > >>>>> `isMacOnMSeries` or >>> > >>>>> > > `isMacOnSilicon` is enough. >>> > >>>>> > > >>> > >>>>> > > We really need to draw a line where we disallow vendor names >>> on it >>> > >>>>> - if >>> > >>>>> > > it's the entire codebase, I don't really think it is >>> realistic. >>> > >>>>> > > >>> > >>>>> > > This was really a mistake, and it was definitely not from >>> > >>>>> referring to the >>> > >>>>> > > existing codebase. Not having a vendor name does not change >>> > >>>>> anything on the >>> > >>>>> > > chance of encountering this issue again. If we really care, >>> we >>> > >>>>> should think >>> > >>>>> > > about style checking, which is the only viable way to catch >>> the >>> > >>>>> mistake. >>> > >>>>> > > Again, I'd argue we have to have a bunch of vendor names in >>> that >>> > >>>>> style >>> > >>>>> > > check, not just the problematic vendor name. >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <sro...@gmail.com >>> > >>> > >>>>> wrote: >>> > >>>>> > > >>> > >>>>> > >> Doesn't the migration code 'clear' the debt? >>> > >>>>> > >> The proposal is not to continue to support the config. >>> > >>>>> > >> I feel like people are not quite understanding the change, >>> and >>> > >>>>> objecting >>> > >>>>> > >> to something that doesn't exist. >>> > >>>>> > >> It's a shame, as this seems like something not even worth >>> > >>>>> discussing. I >>> > >>>>> > >> don't know why this triggered this much discussion. We have >>> kept >>> > >>>>> deprecated >>> > >>>>> > >> methods without blinking, which is in comparison much >>> bigger. >>> > >>>>> > >> Can we maybe ask you review the actual change in question? >>> > >>>>> > >> >>> > >>>>> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie < >>> yangji...@apache.org> >>> > >>>>> wrote: >>> > >>>>> > >> >>> > >>>>> > >>> -1 >>> > >>>>> > >>> Remove migration logic of incorrect `spark.databricks.*` >>> > >>>>> configuration >>> > >>>>> > >>> in Spark 4.0.0 because I think this configuration was >>> initially >>> > >>>>> introduced >>> > >>>>> > >>> accidentally in Spark 3.5.4, lacking a clear design intent. >>> > >>>>> Although the >>> > >>>>> > >>> immediate maintenance cost of retaining this configuration >>> > >>>>> currently seems >>> > >>>>> > >>> limited, as subsequent versions iterate and user habits >>> form, it >>> > >>>>> may lead >>> > >>>>> > >>> to the continuous accumulation of technical debt. When >>> users >>> > >>>>> come to view >>> > >>>>> > >>> this configuration as one that can be relied on long-term, >>> > >>>>> future removal >>> > >>>>> > >>> may face greater resistance from users and could >>> potentially >>> > >>>>> become an >>> > >>>>> > >>> entrenched and redundant configuration in the codebase. >>> > >>>>> Therefore, promptly >>> > >>>>> > >>> correcting this historically accidental configuration not >>> only >>> > >>>>> maintains >>> > >>>>> > >>> the normativity of the Spark configuration system but also >>> > >>>>> prevents >>> > >>>>> > >>> unintended configurations from becoming de facto standards, >>> > >>>>> thereby >>> > >>>>> > >>> reducing long-term maintenance risks. >>> > >>>>> > >>> >>> > >>>>> > >>> Jie Yang >>> > >>>>> > >>> >>> > >>>>> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote: >>> > >>>>> > >>> > -1 because there exists a feasible migration path for >>> Apache >>> > >>>>> Spark >>> > >>>>> > >>> 3.5.4 via Apache Spark 3.5.5. >>> > >>>>> > >>> > >>> > >>>>> > >>> > It's obvious that this Databricks' mistake already >>> causes a >>> > >>>>> huge >>> > >>>>> > >>> communication cost in the Apache Spark community and is >>> > >>>>> suggesting a burden >>> > >>>>> > >>> to enforce us to handle at least two more PRs at 4.0.0 and >>> 4.1.0. >>> > >>>>> > >>> > >>> > >>>>> > >>> > Given that, I don't think >>> > >>>>> > >>> > - This is an inevitable or >>> > >>>>> > >>> > - This is 0 cost >>> > >>>>> > >>> > >>> > >>>>> > >>> > Dongjoon. >>> > >>>>> > >>> > >>> > >>>>> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote: >>> > >>>>> > >>> > > Starting from my +1 (non-binding). >>> > >>>>> > >>> > > >>> > >>>>> > >>> > > In addition, I propose to retain migration logic till >>> Spark >>> > >>>>> 4.1.x and >>> > >>>>> > >>> > > remove it in Spark 4.2.0. >>> > >>>>> > >>> > > >>> > >>>>> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim < >>> > >>>>> > >>> kabhwan.opensou...@gmail.com> >>> > >>>>> > >>> > > wrote: >>> > >>>>> > >>> > > >>> > >>>>> > >>> > > > Hi dev, >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > Please vote to retain migration logic of incorrect >>> > >>>>> > >>> `spark.databricks.*` >>> > >>>>> > >>> > > > configuration in Spark 4.0.x. >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > - DISCUSSION: >>> > >>>>> > >>> > > > >>> > >>>>> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr >>> > >>>>> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being >>> > >>>>> exposed in >>> > >>>>> > >>> 3.5.4 in >>> > >>>>> > >>> > > > Spark 4.0.0+) >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > Specifically, please review this post >>> > >>>>> > >>> > > > >>> > >>>>> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k >>> > >>>>> > >>> which >>> > >>>>> > >>> > > > explains pros and cons about the proposal - proposal >>> is >>> > >>>>> about >>> > >>>>> > >>> "Option 1". >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > Simply speaking, this vote is to allow streaming >>> queries >>> > >>>>> which had >>> > >>>>> > >>> been >>> > >>>>> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark >>> 4.0.x, >>> > >>>>> "without >>> > >>>>> > >>> having to >>> > >>>>> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote >>> > >>>>> passes, we >>> > >>>>> > >>> will help >>> > >>>>> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to >>> Spark >>> > >>>>> 4.0.x, >>> > >>>>> > >>> which would >>> > >>>>> > >>> > > > be almost 1 year. >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > The (only) cons in this option is having to retain >>> the >>> > >>>>> incorrect >>> > >>>>> > >>> > > > configuration name as "string" in the codebase a bit >>> > >>>>> longer. The >>> > >>>>> > >>> code >>> > >>>>> > >>> > > > complexity of migration logic is arguably trivial. >>> (link >>> > >>>>> > >>> > > > < >>> > >>>>> > >>> >>> > >>>>> >>> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183 >>> > >>>>> > >>> > >>> > >>>>> > >>> > > > ) >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports >>> > >>>>> including >>> > >>>>> > >>> migration >>> > >>>>> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 >>> here >>> > >>>>> and leave >>> > >>>>> > >>> the >>> > >>>>> > >>> > > > desired last minor version of Spark to retain this >>> > >>>>> migration logic. >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > The vote is open for the next 72 hours and passes if >>> a >>> > >>>>> majority +1 >>> > >>>>> > >>> PMC >>> > >>>>> > >>> > > > votes are cast, with a minimum of 3 +1 votes. >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > [ ] +1 Retain migration logic of incorrect >>> > >>>>> `spark.databricks.*` >>> > >>>>> > >>> > > > configuration in Spark 4.0.x >>> > >>>>> > >>> > > > [ ] -1 Remove migration logic of incorrect >>> > >>>>> `spark.databricks.*` >>> > >>>>> > >>> > > > configuration in Spark 4.0.0 because... >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > > Thanks! >>> > >>>>> > >>> > > > Jungtaek Lim (HeartSaVioR) >>> > >>>>> > >>> > > > >>> > >>>>> > >>> > > >>> > >>>>> > >>> > >>> > >>>>> > >>> > >>> > >>>>> >>> --------------------------------------------------------------------- >>> > >>>>> > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> > >>>>> > >>> > >>> > >>>>> > >>> > >>> > >>>>> > >>> >>> > >>>>> > >>> >>> > >>>>> >>> --------------------------------------------------------------------- >>> > >>>>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> > >>>>> > >>> >>> > >>>>> > >>> >>> > >>>>> > >>> > >>>>> >>> > >>>>> >>> --------------------------------------------------------------------- >>> > >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> > >>>>> >>> > >>>>> >>> > >>>> >>> > >>>> -- >>> > >>>> Adam Binford >>> > >>>> >>> > >>> >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>