> > this vote is to allow streaming queries which had been ever run in Spark > 3.5.4 to be upgraded with Spark 4.0.x, "without having to be upgraded with > Spark 3.5.5+ in prior".
In the history of Apache Spark, have we ever required users to upgrade to the next maintenance release before moving to a new feature or major release? Xiao Adam Binford <adam...@gmail.com> 于2025年3月11日周二 09:08写道: > +1 (non-binding) > > It's a pretty in the weeds issue with how Structured Streaming works under > the hood that's kinda hard to understand if you're not familiar with it. > The migration logic doesn't mean users can still use the old config, it's > purely behind the scenes to fix checkpoint metadata in streams created in > 3.5.4. The 5 lines of code it takes to address a weird edge case for > certain users that's already gone from master shouldn't be a huge deal. > > On Tue, Mar 11, 2025 at 1:43 AM Yang Jie <yangji...@apache.org> wrote: > >> >> To Sean, you're right, I'm very sorry. >> >> From the perspective of compatibility and migratability, I think we >> should migrate this logic to 4.0.0 and keep it in the codebase for a longer >> time (or permanently), because we can't predict which version users of >> 3.5.4 will choose next. >> >> >> I don't want to discuss the so-called vendor issue. >> >> I withdraw my previous -1. >> >> Jie Yang. >> >> On 2025/03/11 04:42:25 Wenchen Fan wrote: >> > Guys, let’s be honest about what we’re discussing here. >> > >> > If this is a migration issue, why would we even need a vote? We’ve been >> > consistently adding configurations to restore legacy behavior instead of >> > removing them because we understand the challenges of upgrading Spark >> > versions. Our goal has always been to make upgrades easier, even if it >> > means carrying some technical debt. I don’t think we want to change that >> > culture now. >> > >> > If the concern is about vendor names appearing in the codebase, then >> why is >> > it a big deal this time when vendor names are already present >> elsewhere? If >> > we’ve failed to follow a policy, let’s correct it, but can someone >> point to >> > the specific policy we’re violating? >> > >> > If the vote is about adding migration logic to ease the upgrade from >> 3.5.4 >> > to 4.0.0, then +1, why not? >> > >> > Thanks, >> > Wenchen >> > >> > >> > >> > On Mon, Mar 10, 2025 at 8:49 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> >> > wrote: >> > >> > > Well said, Sean. Sorry I made you keep around here since it might not >> be >> > > clearly stated. My bad. >> > > >> > > Yang, how could we ever tolerate the fact there are "other" >> occurrences of >> > > vendor names in the codebase? Please go and search "databricks" in the >> > > codebase and be surprised. >> > > >> > > If we believe that having vendor names in the codebase will increase >> > > the occurrence of making mistakes, why didn't we have a discussion >> thread >> > > earlier to remove all occurrences altogether? This is super tricky >> because >> > > I can even start to argue we have "Apple" as a vendor name in Apache >> Spark >> > > codebase. I'm not saying we use "apple" in the test data. See >> > > `isMacOnAppleSilicon` in Utils. Is it unavoidable? No, >> `isMacOnMSeries` or >> > > `isMacOnSilicon` is enough. >> > > >> > > We really need to draw a line where we disallow vendor names on it - >> if >> > > it's the entire codebase, I don't really think it is realistic. >> > > >> > > This was really a mistake, and it was definitely not from referring >> to the >> > > existing codebase. Not having a vendor name does not change anything >> on the >> > > chance of encountering this issue again. If we really care, we should >> think >> > > about style checking, which is the only viable way to catch the >> mistake. >> > > Again, I'd argue we have to have a bunch of vendor names in that style >> > > check, not just the problematic vendor name. >> > > >> > > >> > > On Tue, Mar 11, 2025 at 12:17 PM Sean Owen <sro...@gmail.com> wrote: >> > > >> > >> Doesn't the migration code 'clear' the debt? >> > >> The proposal is not to continue to support the config. >> > >> I feel like people are not quite understanding the change, and >> objecting >> > >> to something that doesn't exist. >> > >> It's a shame, as this seems like something not even worth >> discussing. I >> > >> don't know why this triggered this much discussion. We have kept >> deprecated >> > >> methods without blinking, which is in comparison much bigger. >> > >> Can we maybe ask you review the actual change in question? >> > >> >> > >> On Mon, Mar 10, 2025, 10:02 PM Yang Jie <yangji...@apache.org> >> wrote: >> > >> >> > >>> -1 >> > >>> Remove migration logic of incorrect `spark.databricks.*` >> configuration >> > >>> in Spark 4.0.0 because I think this configuration was initially >> introduced >> > >>> accidentally in Spark 3.5.4, lacking a clear design intent. >> Although the >> > >>> immediate maintenance cost of retaining this configuration >> currently seems >> > >>> limited, as subsequent versions iterate and user habits form, it >> may lead >> > >>> to the continuous accumulation of technical debt. When users come >> to view >> > >>> this configuration as one that can be relied on long-term, future >> removal >> > >>> may face greater resistance from users and could potentially become >> an >> > >>> entrenched and redundant configuration in the codebase. Therefore, >> promptly >> > >>> correcting this historically accidental configuration not only >> maintains >> > >>> the normativity of the Spark configuration system but also prevents >> > >>> unintended configurations from becoming de facto standards, thereby >> > >>> reducing long-term maintenance risks. >> > >>> >> > >>> Jie Yang >> > >>> >> > >>> On 2025/03/10 14:52:52 Dongjoon Hyun wrote: >> > >>> > -1 because there exists a feasible migration path for Apache Spark >> > >>> 3.5.4 via Apache Spark 3.5.5. >> > >>> > >> > >>> > It's obvious that this Databricks' mistake already causes a huge >> > >>> communication cost in the Apache Spark community and is suggesting >> a burden >> > >>> to enforce us to handle at least two more PRs at 4.0.0 and 4.1.0. >> > >>> > >> > >>> > Given that, I don't think >> > >>> > - This is an inevitable or >> > >>> > - This is 0 cost >> > >>> > >> > >>> > Dongjoon. >> > >>> > >> > >>> > On 2025/03/10 12:46:16 Jungtaek Lim wrote: >> > >>> > > Starting from my +1 (non-binding). >> > >>> > > >> > >>> > > In addition, I propose to retain migration logic till Spark >> 4.1.x and >> > >>> > > remove it in Spark 4.2.0. >> > >>> > > >> > >>> > > On Mon, Mar 10, 2025 at 9:44 PM Jungtaek Lim < >> > >>> kabhwan.opensou...@gmail.com> >> > >>> > > wrote: >> > >>> > > >> > >>> > > > Hi dev, >> > >>> > > > >> > >>> > > > Please vote to retain migration logic of incorrect >> > >>> `spark.databricks.*` >> > >>> > > > configuration in Spark 4.0.x. >> > >>> > > > >> > >>> > > > - DISCUSSION: >> > >>> > > > >> https://lists.apache.org/thread/xzk9729lsmo397crdtk14f74g8cyv4sr >> > >>> > > > ([DISCUSS] Handling spark.databricks.* config being exposed in >> > >>> 3.5.4 in >> > >>> > > > Spark 4.0.0+) >> > >>> > > > >> > >>> > > > Specifically, please review this post >> > >>> > > > >> https://lists.apache.org/thread/xtq1kjhsl4ohfon78z3wld2hmfm78t9k >> > >>> which >> > >>> > > > explains pros and cons about the proposal - proposal is about >> > >>> "Option 1". >> > >>> > > > >> > >>> > > > Simply speaking, this vote is to allow streaming queries >> which had >> > >>> been >> > >>> > > > ever run in Spark 3.5.4 to be upgraded with Spark 4.0.x, >> "without >> > >>> having to >> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes, >> we >> > >>> will help >> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark >> 4.0.x, >> > >>> which would >> > >>> > > > be almost 1 year. >> > >>> > > > >> > >>> > > > The (only) cons in this option is having to retain the >> incorrect >> > >>> > > > configuration name as "string" in the codebase a bit longer. >> The >> > >>> code >> > >>> > > > complexity of migration logic is arguably trivial. (link >> > >>> > > > < >> > >>> >> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183 >> > >>> > >> > >>> > > > ) >> > >>> > > > >> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports >> including >> > >>> migration >> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and >> leave >> > >>> the >> > >>> > > > desired last minor version of Spark to retain this migration >> logic. >> > >>> > > > >> > >>> > > > The vote is open for the next 72 hours and passes if a >> majority +1 >> > >>> PMC >> > >>> > > > votes are cast, with a minimum of 3 +1 votes. >> > >>> > > > >> > >>> > > > [ ] +1 Retain migration logic of incorrect >> `spark.databricks.*` >> > >>> > > > configuration in Spark 4.0.x >> > >>> > > > [ ] -1 Remove migration logic of incorrect >> `spark.databricks.*` >> > >>> > > > configuration in Spark 4.0.0 because... >> > >>> > > > >> > >>> > > > Thanks! >> > >>> > > > Jungtaek Lim (HeartSaVioR) >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> > >> --------------------------------------------------------------------- >> > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >>> > >> > >>> > >> > >>> >> > >>> >> --------------------------------------------------------------------- >> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >>> >> > >>> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > -- > Adam Binford >