Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Wenchen Fan Tue, 04 Mar 2025 22:53:07 -0800

Shall we open an official vote for it? We can put more details on it so
that people can vote:
1. how does it break user workloads without this migration code?
2. what is the Apache policy for leaked vendor names in the codebase? I
think this is not the only one, we also mentioned
`com.databricks.spark.csv` in
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L621C8-L621C32


On Wed, Mar 5, 2025 at 2:40 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> One major question: How do you believe that we can enforce users on
> upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor
> versions at once. Do you really believe we can just break their query?
> What's the data backing up your claim?
>
> I think we agree to disagree. I really don't want "users" to get into
> situations just because of us. It's regardless of who made the mistake -
> it's about what's the proper mitigation for this, and I do not believe
> enforcing users to upgrade to Spark 3.5.8+ before upgrading Spark 4.0 is a
> valid approach.
>
> If I could vote for your alternative option, I'm -1 for it.
>
> On Wed, Mar 5, 2025 at 3:29 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Technically, there is no agreement here. In other words, we have the
>> same situation with the initial discussion thread where we couldn't build a
>> community consensus on this.
>>
>> > I will consider this as "lazy consensus" if there are no objections
>> > for 3 days from initiation of the thread.
>>
>> If you need an explicit veto, here is mine, -1, because I don't think
>> that's just a string.
>>
>> > the problematic config is just a "string",
>>
>> To be clear, as I proposed both in the PR comments and initial discussion
>> thread, I believe we had better keep the AS-IS `master` and `branch-4.0`
>> and recommend to upgrade to the latest version of Apache Spark 3.5.x first
>> before upgrading to Spark 4.
>>
>> Sincerely,
>> Dongjoon.
>>
>>
>> On Tue, Mar 4, 2025 at 8:37 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Bumping on this. Again, this is a blocker for Spark 4.0.0. I will
>>> consider this as "lazy consensus" if there are no objections for 3 days
>>> from initiation of the thread.
>>>
>>> On Tue, Mar 4, 2025 at 2:15 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> This is a spin-up of the original thread "Deprecating and banning
>>>> `spark.databricks.*` config from Apache Spark repository". (link
>>>> <https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd>)
>>>>
>>>> From the original thread, we decided to deprecate the config in Spark
>>>> 3.5.5 and remove the config in Spark 4.0.0. That thread did not decide one
>>>> thing, about smooth migration logic.
>>>>
>>>> We "persist" the config into offset log for streaming query since the
>>>> value of the config must be consistent during the lifecycle of the query.
>>>> This means, the problematic config is already persisted for streaming query
>>>> which ever ran with Spark 3.5.4.
>>>>
>>>> For the migration logic, we re-assign the value of the problematic
>>>> config to the new config. This happens when the query is restarted, and it
>>>> will be reflected into an offset log for "newer batch" so after a couple
>>>> new microbatches the migration logic isn't needed. This migration logic is
>>>> shipped in Spark 3.5.5, so once the query is run with Spark 3.5.5 for a
>>>> couple microbatches, it will be mitigated.
>>>>
>>>> But I would say that there will always be a case that users just bump
>>>> the minor/major version without following all the bugfix versions. I think
>>>> it is still dangerous to remove the migration logic in Spark 4.0.0 (and
>>>> probably Spark 4.1.0, depending on the discussion). From the migration
>>>> logic, the problematic config is just a "string", and users wouldn't be
>>>> able to set the value with the problematic config name. We don't document
>>>> this, as it'll be done automatically.
>>>>
>>>> That said, I'd propose to have migration logic for Spark 4.0 version
>>>> line (at minimum, 4.1 is debatable). This will give a safer and less burden
>>>> migration path for users with just retaining a problematic "string" (again,
>>>> not a config).
>>>>
>>>> I'd love to hear the community's voice on this. I'd like to remind you,
>>>> this is a blocker for Spark 4.0.0.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to