Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-11 Thread Jungtaek Lim
Thanks for the input. > From a quick glance it seems like the incorrect config would just be ignored from the checkpoint, and the new config would just be applied with the default value going forward. That's not how it works. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/or

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-11 Thread Jungtaek Lim
That said, if you guys understand the proposal better and have a preference on one side, could you please participate in the VOTE thread? https://lists.apache.org/thread/nm3p1zjcybdl0p0mc56t2rl92hb9837n Specifically on this topic, I do think the input from users is very important, especially if yo

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Andrew Melo
Hi Jungtaek, I've read the discussion, which is why I replied with my questions (which you neglected to answer). Your deflection and lack of response to direct questions should be (IMO) disqualifying. So, again: To put it into less complicated words - presumably the people using the databricks.*

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Nicholas Chammas
> On Mar 10, 2025, at 10:14 PM, Andrew Melo wrote: >> >> This config was released to "Apache" Spark 3.5.4, so this is NO LONGER just >> a problem with vendor distribution. The breakage will happen even if someone >> does not even know about Databricks Runtime at all and keeps using Apache >>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Jungtaek Lim
Thanks for looking into the issue in depth. What you described is right. I also understand the concern why we keep the buggy behavior, but the QO issue is quite complicated and the most concerning part is that it's "selective". So if the query runs with QO's decision in "one way" in its lifecycle,

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Andrew Melo
Hi Jungtaek below On Mon, Mar 10, 2025 at 9:02 PM Jungtaek Lim wrote: > > Replied inline > > On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo wrote: >> >> Hi Jungtaek, >> >> I've read the discussion, which is why I replied with my questions >> (which you neglected to answer). Your deflection and la

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Adam Binford
I was very confused about this as well but I think I understand it more after reading through the PRs. Jungtaek let me know if this is correct, maybe it will help others understand. There was a bug where streaming queries could prune parts of the query that might have side effects, like stateful q

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Jungtaek Lim
Replied inline On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo wrote: > Hi Jungtaek, > > I've read the discussion, which is why I replied with my questions > (which you neglected to answer). Your deflection and lack of response > to direct questions should be (IMO) disqualifying. So, again: > > To

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Jungtaek Lim
Please read through the explanation of how this impacts the OSS users in the other branch of this discussion. This happened in "Apache" Spark 3.5.4, and the migration logic has nothing to do with the vendor. This is primarily to not break users in "Apache" Spark 3.5.4 who are willing to upgrade dir

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Andrew Melo
Hello Jungtaek, I'm not implying that this improves the vendors life. I'm just not understanding the issue -- the downstream people started a stream with a config option that the upstream people don't want to carry. If the affected users are using the downstream fork (which is how they got the opt

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Jungtaek Lim
One thing I can correct immediately is, downstream does not have any impact at all from this. I believe I clarified that the config will not be modified by anyone, so downstream there is nothing to change. The problem is particular in OSS, downstream does not have any issue with this leak at all. (

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Andrew Melo
Hello all As an outsider, I don't fully understand this discussion. This particular configuration option "leaked" into the open-source Spark distribution, and now there is a lot of discussion about how to mitigate existing workloads. But: presumably the people who are depending on this configurati

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Adam Binford
As someone who has a lot of streams that have been restarted with 3.5.4, I would prefer not to have to restart everything with 3.5.5 but it's definitely doable. But my question is what is the actual behavior if the migration logic was removed? From a quick glance it seems like the incorrect config

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Dongjoon Hyun
Thank you for leading the discussion, Jungtaek. +1 for voting because we couldn't build a unanimous consensus on this specific topic. Thanks, Dongjoon. On 2025/03/07 09:15:39 Jungtaek Lim wrote: > I'll need to start VOTE to move this forward. >

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Sean Owen
What is the problem with the existence of the migration logic? I understand not keeping the misnamed config. But the migration logic does no harm other than taking up a couple lines in the code, no? Unless someone offers any reason this is an issue... what are we even talking about. Is the idea th

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Jungtaek Lim
> Is the idea that the existence of the string 'databricks' in the code is a problem? it is not. This is the point where we have arguments. There is disagreement that this is not just a string which I don't agree with, hence this discussion thread. This is also why I want to see what is "evidence"

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Jungtaek Lim
Sean, it's about how long we keep the vendor name in the codebase (since the migration logic has the problematic config name as a string) for users. We agreed in a previous discussion thread that "this is generally bad", so we fixed the incorrect config name immediately in 3.5. Now it's just a str

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Sean Owen
I don't understand the problem with keeping migration logic in for a long time, just in case. Who cares, it's some bit of check buried somewhere in the streaming code, much like deprecation warnings. There is not somehow an ASF policy compelling the removal of such logic; you are not _required_ to

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-06 Thread Jungtaek Lim
I have to cast -1 (despite non-binding) for every single RC for Spark 4.0.0 till this is settled, since I don't agree with the current status (Dongjoon's proposal is as-is). On the other hand, I want to unblock this and stop bugging the RC phase. Again, I could be only persuaded if this is mandato

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-05 Thread Jungtaek Lim
I think it is how to handle the deprecation and removal. If we leave the migration path for Spark 4.1.x, it will take more than "1 year of upgrade path" to be successful. From our release cadence, Spark 4.2.0 would probably be released March next year or later. And Spark 3.5.4 was released in Dece

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-05 Thread Dongjoon Hyun
Let me reformulate your suggestions and my interpretation. Option 1 "Adding back `spark.databricks.*` in Spark codebase and keep forever" If we follow the proposed logic and reasoning, it means there is no safe version to remove that configuration because Apache Spark 3.5.4 users can jump to any

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-04 Thread Dongjoon Hyun
Technically, there is no agreement here. In other words, we have the same situation with the initial discussion thread where we couldn't build a community consensus on this. > I will consider this as "lazy consensus" if there are no objections > for 3 days from initiation of the thread. If you ne

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-04 Thread Jungtaek Lim
Let's not start with VOTE right now, but let me make clear about options and pros/cons for the option, so that people can choose one over another. Option 1 (Current proposal): retain migration logic for Spark 4.0 (and maybe more minor versions, up to decision) which contains the problematic config

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-04 Thread Wenchen Fan
Shall we open an official vote for it? We can put more details on it so that people can vote: 1. how does it break user workloads without this migration code? 2. what is the Apache policy for leaked vendor names in the codebase? I think this is not the only one, we also mentioned `com.databricks.sp

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-04 Thread Jungtaek Lim
One major question: How do you believe that we can enforce users on upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor versions at once. Do you really believe we can just break their query? What's the data backing up your claim? I think we agree to disagree. I really don't

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-04 Thread Jungtaek Lim
Bumping on this. Again, this is a blocker for Spark 4.0.0. I will consider this as "lazy consensus" if there are no objections for 3 days from initiation of the thread. On Tue, Mar 4, 2025 at 2:15 PM Jungtaek Lim wrote: > Hi dev, > > This is a spin-up of the original thread "Deprecating and bann

[DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-03 Thread Jungtaek Lim
Hi dev, This is a spin-up of the original thread "Deprecating and banning `spark.databricks.*` config from Apache Spark repository". (link ) >From the original thread, we decided to deprecate the config in Spark 3.5.5 and remove th