Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

Dongjoon Hyun Sat, 15 Mar 2025 23:07:04 -0700

Thank you for focusing on this, Mark.

I also agree with you that this should be decided by the Apache Spark PMC and 
appreciate the effort to help us move forward in the Apache way.


As you mentioned, there is no ASF policy. That's true.

> I am not aware of any ASF policy that strictly forbids the mention of a vendor
> in Apache code for any reason

Let's imagine that Apache Spark project starts to support the following 
existing vendor `spark.databricks.*` configs to help Spark users migrate (or 
offload) from Databricks service to open source Spark cluster easily.

- spark.databricks.cluster.profile
- spark.databricks.io.cache.enabled
- spark.databricks.delta.optimizeWrite.enabled
- spark.databricks.passthrough.enabled

Some users or developers may claim that there is a clear and huge benefit. I 
must agree with them because it's also true.

However, is this a way where Apache Spark project aims to go? I cannot agree 
with that.

It's very bad for Apache Spark distribution to support `spark.databricks.*` 
namespace like Spark 3.5.4 because it's misleading to the Apache Spark users by 
dilluting the boundary of Apache Spark distribution (and Brand) and the 
commercial vendor products (and Brand). Note that Apache Spark 3.5.5 and all 
future 3.5.x also support `spark.databricks.*` until April 2025 (the 
end-of-life) because the deprecation is not a deletion nor a ban.

The incident at 3.5.4 was something that should never have happened. It causes 
many confusions already and will more sadly.

The confusion is contagious not only for the distribution, but also for the 
source code. I guess
- The original Databricks contributor was confused what he contribute, maybe.
- The Apache Spark committer (Jungtaek) overlooked what we should not approve 
because the code
  resembles his internal company repo.
- The downstream Apache Spark fork repositories consume `spark.databricks.*` 
namespace
  as Apache Spark's namespace.

For me, it's more misleading to dillute the boundary of Apache Spark code and 
the comercial vendor code.

I have been working on this issue and considering this vote as the last piece 
of the overall handling of the 'spark.databricks.*' incident because I believe 
we are establishing a new rule for the Apache Spark community for the future. 
This will serve as a precedent for handling similar incidents in the future.

Please let me re-summarize the past steps I did with the community:

1. Helping renaming the conf via SPARK-51172 (by approving it)

2. Banning `spark.databricks.*` via SPARK-51173 (by adding `configName` 
Scalastyle rule)

3. Led the discussion thread
"Deprecating and banning `spark.databricks.*` config from Apache Spark 
repository"
https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd

4. Reached the agreement to release Spark 3.5.5 early.
[VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` configuration
https://lists.apache.org/thread/6nn76olr65b8zfgzdcbtr9f6o98451o5

5. Releasing 3.5.5 as a release manager to provide a candidate migration path

6. Proposing for 3.5.4 users to use 3.5.5 as the migration path

I proposed to document on the migration guide of Spark 4.0 (Step 6) because 
that is the only way to handle this incident without using the specific vendor 
config name again in Spark code of `master` and `branch-4.0` branch. As you 
read in Step 2 in the above, I prefer the automatic way. The documentation-only 
solution never be my personal preference. It was the lesser of two evils.

Let me re-iterate this. Although we succeeded to deprecate the configuration 
early, the contaminated release branch `branch-3.5` and its releases still 
support the configuration to the Spark jobs until April 2026 (the end-of-life). 
It's a long-standing live incident which is happening currently.

For the vote, "Retain ... in Spark 4.0.x", I casted -1 because it aims to 
introduce the vendor configuration name (string) back to Apache Spark 4 code. 
It means another contaminated `branch-4.0` will blur the boundary.

On top of that, the Databricks' Apache Spark committer (Jungtaek), who caused 
this incident by merging `spark.databricks.*` code, set a trap on this vote by 
writing the following when he initiated the vote.

> if someone supports including migration logic to be longer than Spark 4.0.x, 
> please cast +1 here and leave the desired last minor version of Spark to 
> retain this migration logic.

At the same time, he casted +1 with the following.

> Starting from my +1 (non-binding).
> In addition, I propose to retain migration logic till Spark 4.1.x and remove 
> it in Spark 4.2.0.

In the open source community, he is playing his own card trick by flipping the 
vote title under everyone's nose like a magic.

> [VOTE] Retain migration logic of incorrect `spark.databricks.*` config in 
> Spark 4.0.x
> [VOTE] Retain migration logic of incorrect `spark.databricks.*` config in 
> Spark 4.0.x/4.1.x

In other words, Jungtaek is trying to spread this terrible and misleading 
situation to the end of life of Spark 4.1.0 (Spring 2027) for now. I can say 
that he will extend it again by ignoring the removal at Spark 4.2+ with the 
same reasons like the following.
- We usually don't introduce the breaking behavior under the same major version.
- The maintenance cost is near zero.
In this case, it will be permanent under Spark 4 (~ 2030?)

Of course, someone might say that it's better than `branch-3.5` situation 
because the migration code is a read-only support. However, it's still in the 
same category which misleads the community to the confusion where Apache Spark 
supports `spark.databricks.*` configurations.

The vote was articulated to cause a longer and bigger side-effect because 
`branch-4.0` and `branch-4.1` has a longer period and many releases in total. 
To prevent the breakout of contagious `spark.databricks.*` situations, we 
should stop now and protect `branch-4.0`. The side-effect and implication is 
huge.

Apache Spark 4.0.0 is the only version we can stop this spreading. So, 
documentation only is the only feasible way to choose.

So, -1 (= The technical justification for the veto is valid)

Sincerely,
Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

Reply via email to