date:20250305

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-05 Thread Dongjoon Hyun

Let me reformulate your suggestions and my interpretation.

Option 1 "Adding back `spark.databricks.*` in Spark codebase and keep
forever"

If we follow the proposed logic and reasoning, it means there is no safe
version to remove that configuration because Apache Spark 3.5.4 users can
jump to any future releases like Spark 4.1.0, 4.2.0, and 5.0.0 technically.
In other words, we cannot remove that logic forever.

That's the reason why we couldn't make an agreement so far.

Option 2 is simply adding a sentence (or more accurate one) for Spark 3.5.4
into the Spark 4.0.0 guideline because all other Spark versions (except
3.5.4) are not contaminated by `spark.databricks.*` conf.

"For Spark 3.5.4 streaming jobs, if you want to migrate the existing
running jobs, we need to upgrade them to Spark 3.5.5+ before upgrading
Spark 4.0"

Dongjoon.


On Tue, Mar 4, 2025 at 11:11 PM Jungtaek Lim 
wrote:

> Let's not start with VOTE right now, but let me make clear about options
> and pros/cons for the option, so that people can choose one over another.
>
> Option 1 (Current proposal): retain migration logic for Spark 4.0 (and
> maybe more minor versions, up to decision) which contains the problematic
> config as "string".
>
> Pros: We can avoid breaking users' queries in any path of upgrade, as long
> as we retain the migration logic. For example, we can support upgrading the
> streaming query which ever ran in Spark 3.5.4 to Spark 4.0.x as long as we
> decide to retain the migration logic for Spark 4.0. Spark 4.1.x, Spark
> 4.2.x, etc. as long as we retain the migration path longer.
> Cons: We retain the concerned config name in the codebase, though it's a
> string and users can never set it.
>
> Option 2 (Dongjoon's proposal): do not bring the migration logic in Spark
> 4.0 and force users to run existing streaming query with Spark 3.5.5+
> before upgrading to Spark 4.0.0+.
>
> Pros: We stop retaining the concerned config name in the codebase.
> Cons: Upgrading directly from Spark 3.5.4 to Spark 4.0+ will be missing
> the critical QO fix, which can lead to a "broken" checkpoint. If the
> checkpoint is broken, there is no way to restore and users have to restart
> the query from scratch. Since the target workload is stateful, in the worst
> case, the query has to start from the earliest data.
>
> I would only agree about the severity if the ASF project had a case of
> vendor name in codebase and it was decided to pay whatever cost to fix the
> case. I'm happy to be corrected if we have the doc in ASF explicitly
> mentioning the case and action item.
>
> On Wed, Mar 5, 2025 at 3:51 PM Wenchen Fan  wrote:
>
>> Shall we open an official vote for it? We can put more details on it so
>> that people can vote:
>> 1. how does it break user workloads without this migration code?
>> 2. what is the Apache policy for leaked vendor names in the codebase? I
>> think this is not the only one, we also mentioned
>> `com.databricks.spark.csv` in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L621C8-L621C32
>>
>> On Wed, Mar 5, 2025 at 2:40 PM Jungtaek Lim 
>> wrote:
>>
>>> One major question: How do you believe that we can enforce users on
>>> upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor
>>> versions at once. Do you really believe we can just break their query?
>>> What's the data backing up your claim?
>>>
>>> I think we agree to disagree. I really don't want "users" to get into
>>> situations just because of us. It's regardless of who made the mistake -
>>> it's about what's the proper mitigation for this, and I do not believe
>>> enforcing users to upgrade to Spark 3.5.8+ before upgrading Spark 4.0 is a
>>> valid approach.
>>>
>>> If I could vote for your alternative option, I'm -1 for it.
>>>
>>> On Wed, Mar 5, 2025 at 3:29 PM Dongjoon Hyun 
>>> wrote:
>>>
 Technically, there is no agreement here. In other words, we have the
 same situation with the initial discussion thread where we couldn't build a
 community consensus on this.

 > I will consider this as "lazy consensus" if there are no objections
 > for 3 days from initiation of the thread.

 If you need an explicit veto, here is mine, -1, because I don't think
 that's just a string.

 > the problematic config is just a "string",

 To be clear, as I proposed both in the PR comments and initial
 discussion thread, I believe we had better keep the AS-IS `master` and
 `branch-4.0` and recommend to upgrade to the latest version of Apache Spark
 3.5.x first before upgrading to Spark 4.

 Sincerely,
 Dongjoon.


 On Tue, Mar 4, 2025 at 8:37 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Bumping on this. Again, this is a blocker for Spark 4.0.0. I will
> consider this as "lazy consensus" if there are no objections for 3 days
> from initiation of the thread.

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-05 Thread Jungtaek Lim

I think it is how to handle the deprecation and removal.

If we leave the migration path for Spark 4.1.x, it will take more than "1
year of upgrade path" to be successful. From our release cadence, Spark
4.2.0 would probably be released March next year or later. And Spark 3.5.4
was released in December last year. Spark 4.0.x may not be very long, but
still provide 6+ months of upgrade path.

I'm not saying we should leave it forever. I'm saying we should try to
reduce the probability, likewise how projects handle the deprecation and
removal while trying to minimize the impact. I see you are predicting the
target to be not much, but that doesn't justify that we are free to do
nothing and make them bugged and dissatisfied with the project.

I see this is projected with "security fix" when we talk about severity,
but security fix does not restrict upgrading path, so what we are about to
do is much worse than that. I'm trying to make this a lot less worse.

I'm doing my best to care about users. Upgrading is not just a "one click",
even for bugfix versions.

On Thu, Mar 6, 2025 at 1:56 AM Dongjoon Hyun 
wrote:

> Let me reformulate your suggestions and my interpretation.
>
> Option 1 "Adding back `spark.databricks.*` in Spark codebase and keep
> forever"
>
> If we follow the proposed logic and reasoning, it means there is no safe
> version to remove that configuration because Apache Spark 3.5.4 users can
> jump to any future releases like Spark 4.1.0, 4.2.0, and 5.0.0 technically.
> In other words, we cannot remove that logic forever.
>
> That's the reason why we couldn't make an agreement so far.
>
> Option 2 is simply adding a sentence (or more accurate one) for Spark
> 3.5.4 into the Spark 4.0.0 guideline because all other Spark versions
> (except 3.5.4) are not contaminated by `spark.databricks.*` conf.
>
> "For Spark 3.5.4 streaming jobs, if you want to migrate the existing
> running jobs, we need to upgrade them to Spark 3.5.5+ before upgrading
> Spark 4.0"
>
> Dongjoon.
>
>
> On Tue, Mar 4, 2025 at 11:11 PM Jungtaek Lim 
> wrote:
>
>> Let's not start with VOTE right now, but let me make clear about options
>> and pros/cons for the option, so that people can choose one over another.
>>
>> Option 1 (Current proposal): retain migration logic for Spark 4.0 (and
>> maybe more minor versions, up to decision) which contains the problematic
>> config as "string".
>>
>> Pros: We can avoid breaking users' queries in any path of upgrade, as
>> long as we retain the migration logic. For example, we can support
>> upgrading the streaming query which ever ran in Spark 3.5.4 to Spark 4.0.x
>> as long as we decide to retain the migration logic for Spark 4.0. Spark
>> 4.1.x, Spark 4.2.x, etc. as long as we retain the migration path longer.
>> Cons: We retain the concerned config name in the codebase, though it's a
>> string and users can never set it.
>>
>> Option 2 (Dongjoon's proposal): do not bring the migration logic in Spark
>> 4.0 and force users to run existing streaming query with Spark 3.5.5+
>> before upgrading to Spark 4.0.0+.
>>
>> Pros: We stop retaining the concerned config name in the codebase.
>> Cons: Upgrading directly from Spark 3.5.4 to Spark 4.0+ will be missing
>> the critical QO fix, which can lead to a "broken" checkpoint. If the
>> checkpoint is broken, there is no way to restore and users have to restart
>> the query from scratch. Since the target workload is stateful, in the worst
>> case, the query has to start from the earliest data.
>>
>> I would only agree about the severity if the ASF project had a case of
>> vendor name in codebase and it was decided to pay whatever cost to fix the
>> case. I'm happy to be corrected if we have the doc in ASF explicitly
>> mentioning the case and action item.
>>
>> On Wed, Mar 5, 2025 at 3:51 PM Wenchen Fan  wrote:
>>
>>> Shall we open an official vote for it? We can put more details on it so
>>> that people can vote:
>>> 1. how does it break user workloads without this migration code?
>>> 2. what is the Apache policy for leaked vendor names in the codebase? I
>>> think this is not the only one, we also mentioned
>>> `com.databricks.spark.csv` in
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L621C8-L621C32
>>>
>>> On Wed, Mar 5, 2025 at 2:40 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 One major question: How do you believe that we can enforce users on
 upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor
 versions at once. Do you really believe we can just break their query?
 What's the data backing up your claim?

 I think we agree to disagree. I really don't want "users" to get into
 situations just because of us. It's regardless of who made the mistake -
 it's about what's the proper mitigation for this, and I do not believe
 enforcing users to upgrade to Spark 3.5.8+ before upgrading Spark

Re: [VOTE] Release Spark 4.0.0 (RC2)

2025-03-05 Thread Chris Nauroth

Here is one more problem I found during RC2 verification:

https://github.com/apache/spark/pull/50173

This one is just a test issue.

Chris Nauroth


On Tue, Mar 4, 2025 at 2:55 PM Jules Damji  wrote:

> - 1 (non-binding)
>
> A ran into number of installation and launching problems. May be it’s my
> enviornment, even though I removed any old binaries and packages.
>
> 1. Pip installing pyspark4.0.0 and pyspark-connect-4.0 from .tz file
> workedl, launching pyspark results into
>
> 25/03/04 14:00:26 ERROR SparkContext: Error initializing SparkContext.
>
> java.lang.ClassNotFoundException:
> org.apache.spark.sql.connect.SparkConnectPlugin
>
>
> 2. Similary installing the tar balls of either distribution and launch
> spark-shell goes into a loop and terminated by the shutdown hook.
>
>
> Thank you Wenchen for leading these release onerous manager efforts, but
> slowly we should be able to install and launch seamlessly.
>
>
> Keep up the good work & tireless effort for the Spark community!
>
>
> cheers
>
> Jules
>
>
> WARNING: Using incubator modules: jdk.incubator.vector
>
> 25/03/04 14:49:35 INFO BaseAllocator: Debug mode disabled. Enable with the
> VM option -Darrow.memory.debug.allocator=true.
>
> 25/03/04 14:49:35 INFO DefaultAllocationManagerOption: allocation manager
> type not specified, using netty as the default type
>
> 25/03/04 14:49:35 INFO CheckAllocator: Using DefaultAllocationManager at
> memory/netty/DefaultAllocationManagerFactory.class
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j2-defaults.properties
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=50 ms, currentRetryNum=1, policy=DefaultPolicy).
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=200 ms, currentRetryNum=2, policy=DefaultPolicy).
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=800 ms, currentRetryNum=3, policy=DefaultPolicy).
>
> 25/03/04 14:49:36 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=3275 ms, currentRetryNum=4, policy=DefaultPolicy).
>
> 25/03/04 14:49:39 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=12995 ms, currentRetryNum=5,
> policy=DefaultPolicy).
>
> ^C25/03/04 14:49:40 INFO ShutdownHookManager: Shutdown hook called
>
>
>
> On Mar 4, 2025, at 2:24 PM, Chris Nauroth  wrote:
>
> -1 (non-binding)
>
> I think I found some missing license information in the binary
> distribution. We may want to include this in the next RC:
>
> https://github.com/apache/spark/pull/50158
>
> Thank you for putting together this RC, Wenchen.
>
> Chris Nauroth
>
>
> On Mon, Mar 3, 2025 at 6:10 AM Wenchen Fan  wrote:
>
>> Thanks for bringing up these blockers! I know RC2 isn’t fully ready yet,
>> but with over 70 commits since RC1, it’s time to have a new RC so people
>> can start testing the latest changes. Please continue testing and keep the
>> feedback coming!
>>
>> On Mon, Mar 3, 2025 at 6:06 PM beliefer  wrote:
>>
>>> -1
>>> https://github.com/apache/spark/pull/50112 should be merged before
>>> release.
>>>
>>>
>>> At 2025-03-01 15:25:06, "Wenchen Fan"  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 4.0.0.
>>>
>>> The vote is open until March 5 (PST) and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 4.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v4.0.0-rc2 (commit
>>> 85188c07519ea809012db24421714bb75b45ab1b)
>>> https://github.com/apache/spark/tree/v4.0.0-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1478/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-docs/
>>>
>>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>
>>> This release is using the release script of the tag v4.0.0-rc2.
>>>
>>> FAQ
>>>
>>> =
>>> How can

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Re: [VOTE] Release Spark 4.0.0 (RC2)

3 matches

Site Navigation

Mail list logo

Footer information