Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-19 Thread Jungtaek Lim
Looks like we are concerned about having the vendor's name in the config as much as security concerns or so, although we would make it still be graceful for a single version line. If that's the case, I'm OK with removing the config in 4.0.0 and taking the path of the migration doc, while deprecati

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mark Hamstra
This doesn't really have anything to do with a broader approach to breaking changes. Removing the mistake in 4.0.0 does not change our striving to avoid breaking APIs or silently changing behavior -- striving is not a guarantee. And the addition of check-in tooling should prevent the issue from rec

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Wenchen Fan
Hi Mark, If I understand correctly, we are introducing a breaking change in 4.0 by removing configs because it is necessary. I’m not suggesting that we are violating the rule, just ensuring that there is consensus on this being a necessary breaking change, which it seems there is. And yes, this is

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Wenchen Fan
Hi Dongjoon, If this is a policy issue that necessitates a breaking change, then sure, let’s proceed. I don’t have a strong opinion on this specific case, but I’m more concerned with the broader approach to breaking changes. I’m referencing this statement from the Spark Version Policy

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Dongjoon Hyun
Thank you for your opinion, Bjorn, Jungtaek, Wenchen, Holden, Mich, and Mark. At least, I believe we agree that we should provide a way to mitigate Apache Spark 3.5.4 issue ASAP. To make a real community action in order to prevent the further spread of `spark.databricks.*` configuration by Spar

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Dongjoon Hyun
I have different perspectives from Wenchen's opinion in three ways. > I’d like to emphasize that a major version release is not a justification > for unnecessary breaking changes. > ...the period between 3.5.5 and 4.0.0 likely isn’t long enough. First, it's an inevitably necessary change to prot

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mark Hamstra
The issue is not how many lines of code it is, but rather how serious of an issue it is to have the databricks namespace in Apache code. It's not a large functional issue, but that doesn't mean that it is only a minor issue, nor do I think that I would characterize the removal of this clear error a

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Wenchen Fan
Hi all, I’d like to emphasize that a major version release is not a justification for unnecessary breaking changes. If we are confident that no one is using this configuration, we should clean it up in 3.5.5 as well. However, if there’s a possibility that users are already relying on it, then lega

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mich Talebzadeh
Depends how you want to play this. As usual a cost/benefit analysis will be useful *Immediate Removal in Spark 3.5.5*: pros: Quickly removes the problematic configuration, reducing technical debt and potential issues. cons: Users upgrading directly from earlier versions to Spark 3.5.5 or later wil

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Jungtaek Lim
Though if we are OK with disturbing users to read the migration guide to figure out the change for the case of direct upgrade to Spark 4.0.0+, I agree this is also one of the valid ways. On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim wrote: > The point is, why can't we remove it from Spark 3.5.5 a

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Jungtaek Lim
The point is, why can't we remove it from Spark 3.5.5 as well if we are planning to "remove" (not deprecate) at the very next minor release? The logic of migration just works without having the incorrect config key to be indicated with SQL config key. That said, the point we debate here is only v

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Holden Karau
I think that removing in 4 sounds reasonable to me as well. It’s important to create a sense of fairness among vendors. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ Books (Learning Spark, H

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Dongjoon Hyun
I don't think there is a reason to keep it at 4.0.0 (and forever?) if we release Spark 3.5.5 with the proper deprecation. This is a big difference, Wenchen. And, the difference is the main reason why I initiated this thread to sugguest to remove 'spark.databricks.*' completely from Apache Spark 4

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-17 Thread Wenchen Fan
It’s unfortunate that we missed identifying these issues during the code review. However, since they have already been released, I believe deprecating them is a better approach than removing them, as the latter would introduce a breaking change. Regarding Jungtaek’s PR

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-17 Thread Jungtaek Lim
I think I can add a color to minimize the concern. The problematic config we added is arguably not user facing. I'd argue moderate users wouldn't even understand what the flag is doing. The config was added because Structured Streaming has been leveraging SQL config to "do the magic" on having two

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-17 Thread Dongjoon Hyun
For Spark 3.5.5, did you see this which is the best the community offer? https://github.com/apache/spark/pull/49985 [SPARK-51187][SQL][SS][3.5] Implement the graceful deprecation of incorrect config introduced in SPARK-49699 Dongjoon On Mon, Feb 17, 2025 at 14:38 Bjørn Jørgensen wrote: > > Hav

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-17 Thread Bjørn Jørgensen
Having breaking changes in a minor seems not that good.. As I'm reading this, "*This could break the query if the rule impacts the query, because the effectiveness of the fix is flipped.*" https://github.com/apache/spark/pull/49897#issuecomment-2652567140 What if we have this https://github.com/

Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-17 Thread Dongjoon Hyun
Hi, All. I'd like to highlight this discussion because this is more important and tricky in a way. As already mentioned in the mailing list and PRs, there was an obvious mistake which missed an improper configuration name, `spark.databricks.*`. https://github.com/apache/spark/blob/a6f220d951742f