Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

Jungtaek Lim Sat, 15 Mar 2025 23:32:54 -0700

Just one hope, I believe I have said I will hear about support of migration
logic for next release. The scope of the VOTE is nothing beyond 4.0.x.
Please do not interpret this in your own way. I continuously see that I am
attacked by what I am not saying and I have to prove that I haven't said.
Please stop it.


On Sun, Mar 16, 2025 at 3:14 PM Jungtaek Lim <[email protected]>
wrote:

> Dongjoon,
>
> I'm now OK with whatever you think, but I argue your vote is technically
> moot since it's about your vote justification, and I have no binding vote
> to counter you. Let's be fair.
>
> On Sun, Mar 16, 2025 at 3:07 PM Dongjoon Hyun <[email protected]> wrote:
>
>> Thank you for focusing on this, Mark.
>>
>> I also agree with you that this should be decided by the Apache Spark PMC
>> and appreciate the effort to help us move forward in the Apache way.
>>
>> As you mentioned, there is no ASF policy. That's true.
>>
>> > I am not aware of any ASF policy that strictly forbids the mention of a
>> vendor
>> > in Apache code for any reason
>>
>> Let's imagine that Apache Spark project starts to support the following
>> existing vendor `spark.databricks.*` configs to help Spark users migrate
>> (or offload) from Databricks service to open source Spark cluster easily.
>>
>> - spark.databricks.cluster.profile
>> - spark.databricks.io.cache.enabled
>> - spark.databricks.delta.optimizeWrite.enabled
>> - spark.databricks.passthrough.enabled
>>
>> Some users or developers may claim that there is a clear and huge
>> benefit. I must agree with them because it's also true.
>>
>> However, is this a way where Apache Spark project aims to go? I cannot
>> agree with that.
>>
>> It's very bad for Apache Spark distribution to support
>> `spark.databricks.*` namespace like Spark 3.5.4 because it's misleading to
>> the Apache Spark users by dilluting the boundary of Apache Spark
>> distribution (and Brand) and the commercial vendor products (and Brand).
>> Note that Apache Spark 3.5.5 and all future 3.5.x also support
>> `spark.databricks.*` until April 2025 (the end-of-life) because the
>> deprecation is not a deletion nor a ban.
>>
>> The incident at 3.5.4 was something that should never have happened. It
>> causes many confusions already and will more sadly.
>>
>> The confusion is contagious not only for the distribution, but also for
>> the source code. I guess
>> - The original Databricks contributor was confused what he contribute,
>> maybe.
>> - The Apache Spark committer (Jungtaek) overlooked what we should not
>> approve because the code
>>   resembles his internal company repo.
>> - The downstream Apache Spark fork repositories consume
>> `spark.databricks.*` namespace
>>   as Apache Spark's namespace.
>>
>> For me, it's more misleading to dillute the boundary of Apache Spark code
>> and the comercial vendor code.
>>
>> I have been working on this issue and considering this vote as the last
>> piece of the overall handling of the 'spark.databricks.*' incident because
>> I believe we are establishing a new rule for the Apache Spark community for
>> the future. This will serve as a precedent for handling similar incidents
>> in the future.
>>
>> Please let me re-summarize the past steps I did with the community:
>>
>> 1. Helping renaming the conf via SPARK-51172 (by approving it)
>>
>> 2. Banning `spark.databricks.*` via SPARK-51173 (by adding `configName`
>> Scalastyle rule)
>>
>> 3. Led the discussion thread
>> "Deprecating and banning `spark.databricks.*` config from Apache Spark
>> repository"
>> https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd
>>
>> 4. Reached the agreement to release Spark 3.5.5 early.
>> [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*`
>> configuration
>> https://lists.apache.org/thread/6nn76olr65b8zfgzdcbtr9f6o98451o5
>>
>> 5. Releasing 3.5.5 as a release manager to provide a candidate migration
>> path
>>
>> 6. Proposing for 3.5.4 users to use 3.5.5 as the migration path
>>
>> I proposed to document on the migration guide of Spark 4.0 (Step 6)
>> because that is the only way to handle this incident without using the
>> specific vendor config name again in Spark code of `master` and
>> `branch-4.0` branch. As you read in Step 2 in the above, I prefer the
>> automatic way. The documentation-only solution never be my personal
>> preference. It was the lesser of two evils.
>>
>> Let me re-iterate this. Although we succeeded to deprecate the
>> configuration early, the contaminated release branch `branch-3.5` and its
>> releases still support the configuration to the Spark jobs until April 2026
>> (the end-of-life). It's a long-standing live incident which is happening
>> currently.
>>
>> For the vote, "Retain ... in Spark 4.0.x", I casted -1 because it aims to
>> introduce the vendor configuration name (string) back to Apache Spark 4
>> code. It means another contaminated `branch-4.0` will blur the boundary.
>>
>> On top of that, the Databricks' Apache Spark committer (Jungtaek), who
>> caused this incident by merging `spark.databricks.*` code, set a trap on
>> this vote by writing the following when he initiated the vote.
>>
>> > if someone supports including migration logic to be longer than Spark
>> 4.0.x, please cast +1 here and leave the desired last minor version of
>> Spark to retain this migration logic.
>>
>> At the same time, he casted +1 with the following.
>>
>> > Starting from my +1 (non-binding).
>> > In addition, I propose to retain migration logic till Spark 4.1.x and
>> remove it in Spark 4.2.0.
>>
>> In the open source community, he is playing his own card trick by
>> flipping the vote title under everyone's nose like a magic.
>>
>> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config
>> in Spark 4.0.x
>> > [VOTE] Retain migration logic of incorrect `spark.databricks.*` config
>> in Spark 4.0.x/4.1.x
>>
>> In other words, Jungtaek is trying to spread this terrible and misleading
>> situation to the end of life of Spark 4.1.0 (Spring 2027) for now. I can
>> say that he will extend it again by ignoring the removal at Spark 4.2+ with
>> the same reasons like the following.
>> - We usually don't introduce the breaking behavior under the same major
>> version.
>> - The maintenance cost is near zero.
>> In this case, it will be permanent under Spark 4 (~ 2030?)
>>
>> Of course, someone might say that it's better than `branch-3.5` situation
>> because the migration code is a read-only support. However, it's still in
>> the same category which misleads the community to the confusion where
>> Apache Spark supports `spark.databricks.*` configurations.
>>
>> The vote was articulated to cause a longer and bigger side-effect because
>> `branch-4.0` and `branch-4.1` has a longer period and many releases in
>> total. To prevent the breakout of contagious `spark.databricks.*`
>> situations, we should stop now and protect `branch-4.0`. The side-effect
>> and implication is huge.
>>
>> Apache Spark 4.0.0 is the only version we can stop this spreading. So,
>> documentation only is the only feasible way to choose.
>>
>> So, -1 (= The technical justification for the veto is valid)
>>
>> Sincerely,
>> Dongjoon.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

Reply via email to