Dongjoon,

I'm now OK with whatever you think, but I argue your vote is technically
moot since it's about your vote justification, and I have no binding vote
to counter you. Let's be fair.

On Sun, Mar 16, 2025 at 3:07 PM Dongjoon Hyun <dongj...@apache.org> wrote:

> Thank you for focusing on this, Mark.
>
> I also agree with you that this should be decided by the Apache Spark PMC
> and appreciate the effort to help us move forward in the Apache way.
>
> As you mentioned, there is no ASF policy. That's true.
>
> > I am not aware of any ASF policy that strictly forbids the mention of a
> vendor
> > in Apache code for any reason
>
> Let's imagine that Apache Spark project starts to support the following
> existing vendor `spark.databricks.*` configs to help Spark users migrate
> (or offload) from Databricks service to open source Spark cluster easily.
>
> - spark.databricks.cluster.profile
> - spark.databricks.io.cache.enabled
> - spark.databricks.delta.optimizeWrite.enabled
> - spark.databricks.passthrough.enabled
>
> Some users or developers may claim that there is a clear and huge benefit.
> I must agree with them because it's also true.
>
> However, is this a way where Apache Spark project aims to go? I cannot
> agree with that.
>
> It's very bad for Apache Spark distribution to support
> `spark.databricks.*` namespace like Spark 3.5.4 because it's misleading to
> the Apache Spark users by dilluting the boundary of Apache Spark
> distribution (and Brand) and the commercial vendor products (and Brand).
> Note that Apache Spark 3.5.5 and all future 3.5.x also support
> `spark.databricks.*` until April 2025 (the end-of-life) because the
> deprecation is not a deletion nor a ban.
>
> The incident at 3.5.4 was something that should never have happened. It
> causes many confusions already and will more sadly.
>
> The confusion is contagious not only for the distribution, but also for
> the source code. I guess
> - The original Databricks contributor was confused what he contribute,
> maybe.
> - The Apache Spark committer (Jungtaek) overlooked what we should not
> approve because the code
>   resembles his internal company repo.
> - The downstream Apache Spark fork repositories consume
> `spark.databricks.*` namespace
>   as Apache Spark's namespace.
>
> For me, it's more misleading to dillute the boundary of Apache Spark code
> and the comercial vendor code.
>
> I have been working on this issue and considering this vote as the last
> piece of the overall handling of the 'spark.databricks.*' incident because
> I believe we are establishing a new rule for the Apache Spark community for
> the future. This will serve as a precedent for handling similar incidents
> in the future.
>
> Please let me re-summarize the past steps I did with the community:
>
> 1. Helping renaming the conf via SPARK-51172 (by approving it)
>
> 2. Banning `spark.databricks.*` via SPARK-51173 (by adding `configName`
> Scalastyle rule)
>
> 3. Led the discussion thread
> "Deprecating and banning `spark.databricks.*` config from Apache Spark
> repository"
> https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd
>
> 4. Reached the agreement to release Spark 3.5.5 early.
> [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*`
> configuration
> https://lists.apache.org/thread/6nn76olr65b8zfgzdcbtr9f6o98451o5
>
> 5. Releasing 3.5.5 as a release manager to provide a candidate migration
> path
>
> 6. Proposing for 3.5.4 users to use 3.5.5 as the migration path
>
> I proposed to document on the migration guide of Spark 4.0 (Step 6)
> because that is the only way to handle this incident without using the
> specific vendor config name again in Spark code of `master` and
> `branch-4.0` branch. As you read in Step 2 in the above, I prefer the
> automatic way. The documentation-only solution never be my personal
> preference. It was the lesser of two evils.
>
> Let me re-iterate this. Although we succeeded to deprecate the
> configuration early, the contaminated release branch `branch-3.5` and its
> releases still support the configuration to the Spark jobs until April 2026
> (the end-of-life). It's a long-standing live incident which is happening
> currently.
>
> For the vote, "Retain ... in Spark 4.0.x", I casted -1 because it aims to
> introduce the vendor configuration name (string) back to Apache Spark 4
> code. It means another contaminated `branch-4.0` will blur the boundary.
>
> On top of that, the Databricks' Apache Spark committer (Jungtaek), who
> caused this incident by merging `spark.databricks.*` code, set a trap on
> this vote by writing the following when he initiated the vote.
>
> > if someone supports including migration logic to be longer than Spark
> 4.0.x, please cast +1 here and leave the desired last minor version of
> Spark to retain this migration logic.
>
> At the same time, he casted +1 with the following.
>
> > Starting from my +1 (non-binding).
> > In addition, I propose to retain migration logic till Spark 4.1.x and
> remove it in Spark 4.2.0.
>
> In the open source community, he is playing his own card trick by flipping
> the vote title under everyone's nose like a magic.
>
> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config
> in Spark 4.0.x
> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config
> in Spark 4.0.x/4.1.x
>
> In other words, Jungtaek is trying to spread this terrible and misleading
> situation to the end of life of Spark 4.1.0 (Spring 2027) for now. I can
> say that he will extend it again by ignoring the removal at Spark 4.2+ with
> the same reasons like the following.
> - We usually don't introduce the breaking behavior under the same major
> version.
> - The maintenance cost is near zero.
> In this case, it will be permanent under Spark 4 (~ 2030?)
>
> Of course, someone might say that it's better than `branch-3.5` situation
> because the migration code is a read-only support. However, it's still in
> the same category which misleads the community to the confusion where
> Apache Spark supports `spark.databricks.*` configurations.
>
> The vote was articulated to cause a longer and bigger side-effect because
> `branch-4.0` and `branch-4.1` has a longer period and many releases in
> total. To prevent the breakout of contagious `spark.databricks.*`
> situations, we should stop now and protect `branch-4.0`. The side-effect
> and implication is huge.
>
> Apache Spark 4.0.0 is the only version we can stop this spreading. So,
> documentation only is the only feasible way to choose.
>
> So, -1 (= The technical justification for the veto is valid)
>
> Sincerely,
> Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to