Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Jungtaek Lim Tue, 04 Mar 2025 23:11:54 -0800

Let's not start with VOTE right now, but let me make clear about options
and pros/cons for the option, so that people can choose one over another.


Option 1 (Current proposal): retain migration logic for Spark 4.0 (and
maybe more minor versions, up to decision) which contains the problematic
config as "string".

Pros: We can avoid breaking users' queries in any path of upgrade, as long
as we retain the migration logic. For example, we can support upgrading the
streaming query which ever ran in Spark 3.5.4 to Spark 4.0.x as long as we
decide to retain the migration logic for Spark 4.0. Spark 4.1.x, Spark
4.2.x, etc. as long as we retain the migration path longer.
Cons: We retain the concerned config name in the codebase, though it's a
string and users can never set it.

Option 2 (Dongjoon's proposal): do not bring the migration logic in Spark
4.0 and force users to run existing streaming query with Spark 3.5.5+
before upgrading to Spark 4.0.0+.

Pros: We stop retaining the concerned config name in the codebase.
Cons: Upgrading directly from Spark 3.5.4 to Spark 4.0+ will be missing the
critical QO fix, which can lead to a "broken" checkpoint. If the checkpoint
is broken, there is no way to restore and users have to restart the query
from scratch. Since the target workload is stateful, in the worst case, the
query has to start from the earliest data.

I would only agree about the severity if the ASF project had a case of
vendor name in codebase and it was decided to pay whatever cost to fix the
case. I'm happy to be corrected if we have the doc in ASF explicitly
mentioning the case and action item.

On Wed, Mar 5, 2025 at 3:51 PM Wenchen Fan <[email protected]> wrote:

> Shall we open an official vote for it? We can put more details on it so
> that people can vote:
> 1. how does it break user workloads without this migration code?
> 2. what is the Apache policy for leaked vendor names in the codebase? I
> think this is not the only one, we also mentioned
> `com.databricks.spark.csv` in
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L621C8-L621C32
>
> On Wed, Mar 5, 2025 at 2:40 PM Jungtaek Lim <[email protected]>
> wrote:
>
>> One major question: How do you believe that we can enforce users on
>> upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor
>> versions at once. Do you really believe we can just break their query?
>> What's the data backing up your claim?
>>
>> I think we agree to disagree. I really don't want "users" to get into
>> situations just because of us. It's regardless of who made the mistake -
>> it's about what's the proper mitigation for this, and I do not believe
>> enforcing users to upgrade to Spark 3.5.8+ before upgrading Spark 4.0 is a
>> valid approach.
>>
>> If I could vote for your alternative option, I'm -1 for it.
>>
>> On Wed, Mar 5, 2025 at 3:29 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Technically, there is no agreement here. In other words, we have the
>>> same situation with the initial discussion thread where we couldn't build a
>>> community consensus on this.
>>>
>>> > I will consider this as "lazy consensus" if there are no objections
>>> > for 3 days from initiation of the thread.
>>>
>>> If you need an explicit veto, here is mine, -1, because I don't think
>>> that's just a string.
>>>
>>> > the problematic config is just a "string",
>>>
>>> To be clear, as I proposed both in the PR comments and initial
>>> discussion thread, I believe we had better keep the AS-IS `master` and
>>> `branch-4.0` and recommend to upgrade to the latest version of Apache Spark
>>> 3.5.x first before upgrading to Spark 4.
>>>
>>> Sincerely,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Mar 4, 2025 at 8:37 PM Jungtaek Lim <
>>> [email protected]> wrote:
>>>
>>>> Bumping on this. Again, this is a blocker for Spark 4.0.0. I will
>>>> consider this as "lazy consensus" if there are no objections for 3 days
>>>> from initiation of the thread.
>>>>
>>>> On Tue, Mar 4, 2025 at 2:15 PM Jungtaek Lim <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> This is a spin-up of the original thread "Deprecating and banning
>>>>> `spark.databricks.*` config from Apache Spark repository". (link
>>>>> <https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd>)
>>>>>
>>>>> From the original thread, we decided to deprecate the config in Spark
>>>>> 3.5.5 and remove the config in Spark 4.0.0. That thread did not decide one
>>>>> thing, about smooth migration logic.
>>>>>
>>>>> We "persist" the config into offset log for streaming query since the
>>>>> value of the config must be consistent during the lifecycle of the query.
>>>>> This means, the problematic config is already persisted for streaming 
>>>>> query
>>>>> which ever ran with Spark 3.5.4.
>>>>>
>>>>> For the migration logic, we re-assign the value of the problematic
>>>>> config to the new config. This happens when the query is restarted, and it
>>>>> will be reflected into an offset log for "newer batch" so after a couple
>>>>> new microbatches the migration logic isn't needed. This migration logic is
>>>>> shipped in Spark 3.5.5, so once the query is run with Spark 3.5.5 for a
>>>>> couple microbatches, it will be mitigated.
>>>>>
>>>>> But I would say that there will always be a case that users just bump
>>>>> the minor/major version without following all the bugfix versions. I think
>>>>> it is still dangerous to remove the migration logic in Spark 4.0.0 (and
>>>>> probably Spark 4.1.0, depending on the discussion). From the migration
>>>>> logic, the problematic config is just a "string", and users wouldn't be
>>>>> able to set the value with the problematic config name. We don't document
>>>>> this, as it'll be done automatically.
>>>>>
>>>>> That said, I'd propose to have migration logic for Spark 4.0 version
>>>>> line (at minimum, 4.1 is debatable). This will give a safer and less 
>>>>> burden
>>>>> migration path for users with just retaining a problematic "string" 
>>>>> (again,
>>>>> not a config).
>>>>>
>>>>> I'd love to hear the community's voice on this. I'd like to remind
>>>>> you, this is a blocker for Spark 4.0.0.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to