Just one hope, I believe I have said I will hear about support of migration logic for next release. The scope of the VOTE is nothing beyond 4.0.x. Please do not interpret this in your own way. I continuously see that I am attacked by what I am not saying and I have to prove that I haven't said. Please stop it.
On Sun, Mar 16, 2025 at 3:14 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Dongjoon, > > I'm now OK with whatever you think, but I argue your vote is technically > moot since it's about your vote justification, and I have no binding vote > to counter you. Let's be fair. > > On Sun, Mar 16, 2025 at 3:07 PM Dongjoon Hyun <dongj...@apache.org> wrote: > >> Thank you for focusing on this, Mark. >> >> I also agree with you that this should be decided by the Apache Spark PMC >> and appreciate the effort to help us move forward in the Apache way. >> >> As you mentioned, there is no ASF policy. That's true. >> >> > I am not aware of any ASF policy that strictly forbids the mention of a >> vendor >> > in Apache code for any reason >> >> Let's imagine that Apache Spark project starts to support the following >> existing vendor `spark.databricks.*` configs to help Spark users migrate >> (or offload) from Databricks service to open source Spark cluster easily. >> >> - spark.databricks.cluster.profile >> - spark.databricks.io.cache.enabled >> - spark.databricks.delta.optimizeWrite.enabled >> - spark.databricks.passthrough.enabled >> >> Some users or developers may claim that there is a clear and huge >> benefit. I must agree with them because it's also true. >> >> However, is this a way where Apache Spark project aims to go? I cannot >> agree with that. >> >> It's very bad for Apache Spark distribution to support >> `spark.databricks.*` namespace like Spark 3.5.4 because it's misleading to >> the Apache Spark users by dilluting the boundary of Apache Spark >> distribution (and Brand) and the commercial vendor products (and Brand). >> Note that Apache Spark 3.5.5 and all future 3.5.x also support >> `spark.databricks.*` until April 2025 (the end-of-life) because the >> deprecation is not a deletion nor a ban. >> >> The incident at 3.5.4 was something that should never have happened. It >> causes many confusions already and will more sadly. >> >> The confusion is contagious not only for the distribution, but also for >> the source code. I guess >> - The original Databricks contributor was confused what he contribute, >> maybe. >> - The Apache Spark committer (Jungtaek) overlooked what we should not >> approve because the code >> resembles his internal company repo. >> - The downstream Apache Spark fork repositories consume >> `spark.databricks.*` namespace >> as Apache Spark's namespace. >> >> For me, it's more misleading to dillute the boundary of Apache Spark code >> and the comercial vendor code. >> >> I have been working on this issue and considering this vote as the last >> piece of the overall handling of the 'spark.databricks.*' incident because >> I believe we are establishing a new rule for the Apache Spark community for >> the future. This will serve as a precedent for handling similar incidents >> in the future. >> >> Please let me re-summarize the past steps I did with the community: >> >> 1. Helping renaming the conf via SPARK-51172 (by approving it) >> >> 2. Banning `spark.databricks.*` via SPARK-51173 (by adding `configName` >> Scalastyle rule) >> >> 3. Led the discussion thread >> "Deprecating and banning `spark.databricks.*` config from Apache Spark >> repository" >> https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd >> >> 4. Reached the agreement to release Spark 3.5.5 early. >> [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` >> configuration >> https://lists.apache.org/thread/6nn76olr65b8zfgzdcbtr9f6o98451o5 >> >> 5. Releasing 3.5.5 as a release manager to provide a candidate migration >> path >> >> 6. Proposing for 3.5.4 users to use 3.5.5 as the migration path >> >> I proposed to document on the migration guide of Spark 4.0 (Step 6) >> because that is the only way to handle this incident without using the >> specific vendor config name again in Spark code of `master` and >> `branch-4.0` branch. As you read in Step 2 in the above, I prefer the >> automatic way. The documentation-only solution never be my personal >> preference. It was the lesser of two evils. >> >> Let me re-iterate this. Although we succeeded to deprecate the >> configuration early, the contaminated release branch `branch-3.5` and its >> releases still support the configuration to the Spark jobs until April 2026 >> (the end-of-life). It's a long-standing live incident which is happening >> currently. >> >> For the vote, "Retain ... in Spark 4.0.x", I casted -1 because it aims to >> introduce the vendor configuration name (string) back to Apache Spark 4 >> code. It means another contaminated `branch-4.0` will blur the boundary. >> >> On top of that, the Databricks' Apache Spark committer (Jungtaek), who >> caused this incident by merging `spark.databricks.*` code, set a trap on >> this vote by writing the following when he initiated the vote. >> >> > if someone supports including migration logic to be longer than Spark >> 4.0.x, please cast +1 here and leave the desired last minor version of >> Spark to retain this migration logic. >> >> At the same time, he casted +1 with the following. >> >> > Starting from my +1 (non-binding). >> > In addition, I propose to retain migration logic till Spark 4.1.x and >> remove it in Spark 4.2.0. >> >> In the open source community, he is playing his own card trick by >> flipping the vote title under everyone's nose like a magic. >> >> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config >> in Spark 4.0.x >> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config >> in Spark 4.0.x/4.1.x >> >> In other words, Jungtaek is trying to spread this terrible and misleading >> situation to the end of life of Spark 4.1.0 (Spring 2027) for now. I can >> say that he will extend it again by ignoring the removal at Spark 4.2+ with >> the same reasons like the following. >> - We usually don't introduce the breaking behavior under the same major >> version. >> - The maintenance cost is near zero. >> In this case, it will be permanent under Spark 4 (~ 2030?) >> >> Of course, someone might say that it's better than `branch-3.5` situation >> because the migration code is a read-only support. However, it's still in >> the same category which misleads the community to the confusion where >> Apache Spark supports `spark.databricks.*` configurations. >> >> The vote was articulated to cause a longer and bigger side-effect because >> `branch-4.0` and `branch-4.1` has a longer period and many releases in >> total. To prevent the breakout of contagious `spark.databricks.*` >> situations, we should stop now and protect `branch-4.0`. The side-effect >> and implication is huge. >> >> Apache Spark 4.0.0 is the only version we can stop this spreading. So, >> documentation only is the only feasible way to choose. >> >> So, -1 (= The technical justification for the veto is valid) >> >> Sincerely, >> Dongjoon. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>