Dongjoon, I'm now OK with whatever you think, but I argue your vote is technically moot since it's about your vote justification, and I have no binding vote to counter you. Let's be fair.
On Sun, Mar 16, 2025 at 3:07 PM Dongjoon Hyun <dongj...@apache.org> wrote: > Thank you for focusing on this, Mark. > > I also agree with you that this should be decided by the Apache Spark PMC > and appreciate the effort to help us move forward in the Apache way. > > As you mentioned, there is no ASF policy. That's true. > > > I am not aware of any ASF policy that strictly forbids the mention of a > vendor > > in Apache code for any reason > > Let's imagine that Apache Spark project starts to support the following > existing vendor `spark.databricks.*` configs to help Spark users migrate > (or offload) from Databricks service to open source Spark cluster easily. > > - spark.databricks.cluster.profile > - spark.databricks.io.cache.enabled > - spark.databricks.delta.optimizeWrite.enabled > - spark.databricks.passthrough.enabled > > Some users or developers may claim that there is a clear and huge benefit. > I must agree with them because it's also true. > > However, is this a way where Apache Spark project aims to go? I cannot > agree with that. > > It's very bad for Apache Spark distribution to support > `spark.databricks.*` namespace like Spark 3.5.4 because it's misleading to > the Apache Spark users by dilluting the boundary of Apache Spark > distribution (and Brand) and the commercial vendor products (and Brand). > Note that Apache Spark 3.5.5 and all future 3.5.x also support > `spark.databricks.*` until April 2025 (the end-of-life) because the > deprecation is not a deletion nor a ban. > > The incident at 3.5.4 was something that should never have happened. It > causes many confusions already and will more sadly. > > The confusion is contagious not only for the distribution, but also for > the source code. I guess > - The original Databricks contributor was confused what he contribute, > maybe. > - The Apache Spark committer (Jungtaek) overlooked what we should not > approve because the code > resembles his internal company repo. > - The downstream Apache Spark fork repositories consume > `spark.databricks.*` namespace > as Apache Spark's namespace. > > For me, it's more misleading to dillute the boundary of Apache Spark code > and the comercial vendor code. > > I have been working on this issue and considering this vote as the last > piece of the overall handling of the 'spark.databricks.*' incident because > I believe we are establishing a new rule for the Apache Spark community for > the future. This will serve as a precedent for handling similar incidents > in the future. > > Please let me re-summarize the past steps I did with the community: > > 1. Helping renaming the conf via SPARK-51172 (by approving it) > > 2. Banning `spark.databricks.*` via SPARK-51173 (by adding `configName` > Scalastyle rule) > > 3. Led the discussion thread > "Deprecating and banning `spark.databricks.*` config from Apache Spark > repository" > https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd > > 4. Reached the agreement to release Spark 3.5.5 early. > [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` > configuration > https://lists.apache.org/thread/6nn76olr65b8zfgzdcbtr9f6o98451o5 > > 5. Releasing 3.5.5 as a release manager to provide a candidate migration > path > > 6. Proposing for 3.5.4 users to use 3.5.5 as the migration path > > I proposed to document on the migration guide of Spark 4.0 (Step 6) > because that is the only way to handle this incident without using the > specific vendor config name again in Spark code of `master` and > `branch-4.0` branch. As you read in Step 2 in the above, I prefer the > automatic way. The documentation-only solution never be my personal > preference. It was the lesser of two evils. > > Let me re-iterate this. Although we succeeded to deprecate the > configuration early, the contaminated release branch `branch-3.5` and its > releases still support the configuration to the Spark jobs until April 2026 > (the end-of-life). It's a long-standing live incident which is happening > currently. > > For the vote, "Retain ... in Spark 4.0.x", I casted -1 because it aims to > introduce the vendor configuration name (string) back to Apache Spark 4 > code. It means another contaminated `branch-4.0` will blur the boundary. > > On top of that, the Databricks' Apache Spark committer (Jungtaek), who > caused this incident by merging `spark.databricks.*` code, set a trap on > this vote by writing the following when he initiated the vote. > > > if someone supports including migration logic to be longer than Spark > 4.0.x, please cast +1 here and leave the desired last minor version of > Spark to retain this migration logic. > > At the same time, he casted +1 with the following. > > > Starting from my +1 (non-binding). > > In addition, I propose to retain migration logic till Spark 4.1.x and > remove it in Spark 4.2.0. > > In the open source community, he is playing his own card trick by flipping > the vote title under everyone's nose like a magic. > > > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config > in Spark 4.0.x > > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config > in Spark 4.0.x/4.1.x > > In other words, Jungtaek is trying to spread this terrible and misleading > situation to the end of life of Spark 4.1.0 (Spring 2027) for now. I can > say that he will extend it again by ignoring the removal at Spark 4.2+ with > the same reasons like the following. > - We usually don't introduce the breaking behavior under the same major > version. > - The maintenance cost is near zero. > In this case, it will be permanent under Spark 4 (~ 2030?) > > Of course, someone might say that it's better than `branch-3.5` situation > because the migration code is a read-only support. However, it's still in > the same category which misleads the community to the confusion where > Apache Spark supports `spark.databricks.*` configurations. > > The vote was articulated to cause a longer and bigger side-effect because > `branch-4.0` and `branch-4.1` has a longer period and many releases in > total. To prevent the breakout of contagious `spark.databricks.*` > situations, we should stop now and protect `branch-4.0`. The side-effect > and implication is huge. > > Apache Spark 4.0.0 is the only version we can stop this spreading. So, > documentation only is the only feasible way to choose. > > So, -1 (= The technical justification for the veto is valid) > > Sincerely, > Dongjoon. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >