Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Jungtaek Lim Mon, 17 Feb 2025 17:00:33 -0800

I think I can add a color to minimize the concern.

The problematic config we added is arguably not user facing. I'd argue
moderate users wouldn't even understand what the flag is doing. The config
was added because Structured Streaming has been leveraging SQL config to
"do the magic" on having two different default values for new query vs old
query (checkpoint is created from the version where the fix is not landed).
This is purely used for backward compatibility, not something we want to
give users flexibility.


That said, I don't see a risk of removing config "at any point". (I'd even
say removing this config in Spark 3.5.5 does not change anything. The
reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just
to address any concern on being conservative.)

I think you are worrying about case 1 from my comment. From my new change (
link <https://github.com/apache/spark/pull/49983>), I made a migration
logic when the offset log contains the problematic configuration - we will
take the value, but put the value to the new config, and at the next
microbatch planning, the offset log will contain the new configuration
going forward. This addresses the case 1, as long as we retain the
migration logic for a couple minor releases (say, 4.2 or so). We just need
to support this migration logic for the time where we never thought of
jumping directly from Spark 3.5.4 to the version.

Hope this helps to address your concern/worrying.


On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

>
> Having breaking changes in a minor seems not that good.. As I'm reading
> this,
>
> "*This could break the query if the rule impacts the query, because the
> effectiveness of the fix is flipped.*"
> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>
>
> What if we have this https://github.com/apache/spark/pull/48149 change in
> the branch and remove it only for version 4? That way we dont break
> anything.
>
>
>
>
> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <dongjoon.h...@gmail.com
> >:
>
>> Hi, All.
>>
>> I'd like to highlight this discussion because this is more important and
>> tricky in a way.
>>
>> As already mentioned in the mailing list and PRs, there was an obvious
>> mistake
>> which missed an improper configuration name, `spark.databricks.*`.
>>
>>
>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>
>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>
>> In fact, Apache Spark committers have been preventing this repetitive
>> mistake
>> pattern during the review stages successfully until we slip the following
>> backportings
>> at Apache Spark 3.5.4.
>>
>> https://github.com/apache/spark/pull/45649
>> https://github.com/apache/spark/pull/48149
>> https://github.com/apache/spark/pull/49121
>>
>> At this point of writing, `spark.databricks.*` was removed successfully
>> from `master`
>> and `branch-4.0` and a new ScalaStyle rule was added to protect Apache
>> Spark repository
>> from future mistakes.
>>
>> SPARK-51172 Rename to
>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>> SPARK-51173 Add `configName` Scalastyle rule
>>
>> What I proposed is to release Apache Spark 3.5.5 next week with the
>> deprecation
>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>> configuration.
>>
>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>> alternative)
>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config)
>>
>> In addition, I'd like to volunteer as a release manager of Apache Spark
>> 3.5.5
>> for a swift release. WDYT?
>>
>> FYI, `branch-3.5` has 37 patches currently.
>>
>> $ git log --oneline v3.5.4..HEAD | wc -l
>>       37
>>
>> Best Regards,
>> Dongjoon.
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to