Re: Potential Impact of Hive Upgrades on Spark Tables

2024-05-01 Thread Mich Talebzadeh
It is important to consider potential impacts on Spark tables stored
in the Hive metastore during an "upgrade". Depending on the upgrade
path, the Hive metastore schema or SerDes behavior might change,
requiring adjustments in the Sparkark code
or configurations. I mentioned the need to test the Spark applications
thoroughly after a Hive upgrade, which will necessitates liaising with
Hive group as your are relying on their metdadata


Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

On Wed, 1 May 2024 at 04:30, Wenchen Fan  wrote:
>
> Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an 
> issue as many users create native Spark data source tables already today, by 
> explicitly putting the `USING` clause in the CREATE TABLE statement.
>
> On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh  
> wrote:
>>
>> @Wenchen Fan Got your explanation, thanks!
>>
>> My understanding is that even if we create Spark tables using Spark's
>> native data sources, by default, the metadata about these tables will
>> be stored in the Hive metastore. As a consequence, a Hive upgrade can
>> potentially affect Spark tables. For example, depending on the
>> severity of the changes, the Hive metastore schema might change, which
>> could require Spark code to be updated to handle these changes in how
>> table metadata is represented. Is this assertion correct?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>>
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Hey all,

Reviving this thread, but Spark master has already accumulated a huge
amount of changes.  As a downstream project maintainer, I want to really
start testing the new features and other breaking changes, and it's hard to
do that without a Preview release. So the sooner we make a Preview release,
the faster we can start getting feedback for fixing things for a great
Spark 4.0 final release.

So I urge the community to produce a Spark 4.0 Preview soon even if certain
features targeting the Delta 4.0 release are still incomplete.

Thanks!


On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan
Yea I think a preview release won't hurt (without a branch cut). We don't
need to wait for all the ongoing projects to be ready. How about we do a
4.0 preview release based on the current master branch next Monday?

On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale
Thanks Wenchen for starting this!

How do we define "the user" for spark?
1. End users: There are some users that use spark as a service from a
provider
2. Providers/Operators: There are some users that provide spark as a
service for their internal(on-prem setup with yarn/k8s)/external(Something
like EMR) customers
3. ?

Perhaps we need to consider infrastructure behavior changes as well to
accommodate the second group of users.

On 1 May 2024, at 06:08, Wenchen Fan  wrote:

Hi all,

It's exciting to see innovations keep happening in the Spark community and
Spark keeps evolving itself. To make these innovations available to more
users, it's important to help users upgrade to newer Spark versions easily.
We've done a good job on it: the PR template requires the author to write
down user-facing behavior changes, and the migration guide contains
behavior changes that need attention from users. Sometimes behavior changes
come with a legacy config to restore the old behavior. However, we still
lack a clear definition of behavior changes and I propose the following
definition:

Behavior changes mean user-visible functional changes in a new release via
public APIs. This means new features, and even bug fixes that eliminate NPE
or correct query results, are behavior changes. Things like performance
improvement, code refactoring, and changes to unreleased APIs/features are
not. All behavior changes should be called out in the PR description. We
need to write an item in the migration guide (and probably legacy config)
for those that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any change to the public Python/SQL/Scala/Java/R APIs: rename
   function, remove parameters, add parameters, rename parameters, change
   parameter default values, etc. These changes should be avoided in general,
   or do it in a compatible way like deprecating and adding a new function
   instead of renaming.

Once we reach a conclusion, I'll document it in
https://spark.apache.org/versioning-policy.html .

Thanks,
Wenchen


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Next week sounds great! Thank you Wenchen!

On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:

> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
>> Hey all,
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>> Thanks!
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>>> Thank you all for the replies!
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
 will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

 Thanks,
 Cheng Pan


 > On Apr 15, 2024, at 09:58, Jungtaek Lim 
 wrote:
 >
 > W.r.t. state data source - reader (SPARK-45511), there are several
 follow-up tickets, but we don't plan to address them soon. The current
 implementation is the final shape for Spark 4.0.0, unless there are demands
 on the follow-up tickets.
 >
 > We may want to check the plan for transformWithState - my
 understanding is that we want to release the feature to 4.0.0, but there
 are several remaining works to be done. While the tentative timeline for
 releasing is June 2024, what would be the tentative timeline for the RC 
 cut?
 > (cc. Anish to add more context on the plan for transformWithState)
 >
 > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
 wrote:
 > Hi all,
 >
 > It's close to the previously proposed 4.0.0 release date (June 2024),
 and I think it's time to prepare for it and discuss the ongoing projects:
 > •
 > ANSI by default
 > • Spark Connect GA
 > • Structured Logging
 > • Streaming state store data source
 > • new data type VARIANT
 > • STRING collation support
 > • Spark k8s operator versioning
 > Please help to add more items to this list that are missed here. I
 would like to volunteer as the release manager for Apache Spark 4.0.0 if
 there is no objection. Thank you all for the great work that fills Spark
 4.0!
 >
 > Wenchen Fan




Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Good point, Santosh!

I was originally targeting end users who write queries with Spark, as this
is probably the largest user base. But we should definitely consider other
users who deploy and manage Spark clusters. Those users are usually more
tolerant of behavior changes and I think it should be sufficient to put
behavior changes in this area in the release notes.

On Wed, May 1, 2024 at 11:18 PM Santosh Pingale 
wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen
Thanks for raising this important discussion Wenchen! Two points I would
like to raise, though I'm fully supportive of any improvements in this
regard, my points below notwithstanding -- I am not intending to let
perfect be the enemy of good here.

On a similar note as Santosh's comment, we should consider how this relates
to developer APIs. Let's say I am an end user relying on some library like
frameless , which relies on
developer APIs in Spark. When we make a change to Spark's developer APIs
that requires a corresponding change in frameless, I don't directly see
that change as an end user, but it *does* impact me, because now I have to
upgrade to a new version of frameless that supports those new changes. This
can have ripple effects across the ecosystem. Should we call out such
changes so that end users understand the potential impact to libraries they
use?

Second point, what about binary compatibility? Currently our versioning
policy says "Link-level compatibility is something we’ll try to guarantee
in future releases." (FWIW, it has said this since at least 2016
...)
One step towards this would be to clearly call out any binary-incompatible
changes in our release notes, to help users understand if they may be
impacted. Similar to my first point, this has ripple effects across the
ecosystem -- if I just use Spark itself, recompiling is probably not a big
deal, but if I use N libraries that each depend on Spark, then after a
binary-incompatible change is made I have to wait for all N libraries to
publish new compatible versions before I can upgrade myself, presenting a
nontrivial barrier to adoption.

On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
 wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun
+1 for next Monday.

Dongjoon.

On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
wrote:

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are 
> demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but there
> are several remaining works to be done. While the tentative timeline for
> releasing is June 2024, what would be the tentative timeline for the RC 
> cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I
> would like to volunteer as the release manager for Apache Spark 4.0.0 if
> there is no objection. Thank you all for the great work that fills Spark
> 4.0!
> >
> > Wenchen Fan
>
>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li
+1 for next Monday.

We can do more previews when the other features are ready for preview.

Tathagata Das  于2024年5月1日周三 08:46写道:

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are 
> demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but there
> are several remaining works to be done. While the tentative timeline for
> releasing is June 2024, what would be the tentative timeline for the RC 
> cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I
> would like to volunteer as the release manager for Apache Spark 4.0.0 if
> there is no objection. Thank you all for the great work that fills Spark
> 4.0!
> >
> > Wenchen Fan
>
>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon
SGTM

On Thu, 2 May 2024 at 02:06, Dongjoon Hyun  wrote:

> +1 for next Monday.
>
> Dongjoon.
>
> On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
> wrote:
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun
+1

On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

> +1 for next Monday.
>
> We can do more previews when the other features are ready for preview.
>
> Tathagata Das  于2024年5月1日周三 08:46写道:
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes
should be avoided as much as we can and new APIs should be mentioned in the
release notes. Breaking binary compatibility is also a "functional change"
and should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst
Expression and LogicalPlan. It's too much work to track all the changes to
private APIs and I think it's the downstream library's responsibility to
check such changes in new Spark versions, or avoid using private APIs.
Exceptions can happen if certain private APIs are used too widely and we
should avoid breaking them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:

> Thanks for raising this important discussion Wenchen! Two points I would
> like to raise, though I'm fully supportive of any improvements in this
> regard, my points below notwithstanding -- I am not intending to let
> perfect be the enemy of good here.
>
> On a similar note as Santosh's comment, we should consider how this
> relates to developer APIs. Let's say I am an end user relying on some
> library like frameless , which
> relies on developer APIs in Spark. When we make a change to Spark's
> developer APIs that requires a corresponding change in frameless, I don't
> directly see that change as an end user, but it *does* impact me, because
> now I have to upgrade to a new version of frameless that supports those new
> changes. This can have ripple effects across the ecosystem. Should we call
> out such changes so that end users understand the potential impact to
> libraries they use?
>
> Second point, what about binary compatibility? Currently our versioning
> policy says "Link-level compatibility is something we’ll try to guarantee
> in future releases." (FWIW, it has said this since at least 2016
> ...)
> One step towards this would be to clearly call out any binary-incompatible
> changes in our release notes, to help users understand if they may be
> impacted. Similar to my first point, this has ripple effects across the
> ecosystem -- if I just use Spark itself, recompiling is probably not a big
> deal, but if I use N libraries that each depend on Spark, then after a
> binary-incompatible change is made I have to wait for all N libraries to
> publish new compatible versions before I can upgrade myself, presenting a
> nontrivial barrier to adoption.
>
> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>  wrote:
>
>> Thanks Wenchen for starting this!
>>
>> How do we define "the user" for spark?
>> 1. End users: There are some users that use spark as a service from a
>> provider
>> 2. Providers/Operators: There are some users that provide spark as a
>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>> like EMR) customers
>> 3. ?
>>
>> Perhaps we need to consider infrastructure behavior changes as well to
>> accommodate the second group of users.
>>
>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> It's exciting to see innovations keep happening in the Spark community
>> and Spark keeps evolving itself. To make these innovations available to
>> more users, it's important to help users upgrade to newer Spark versions
>> easily. We've done a good job on it: the PR template requires the author to
>> write down user-facing behavior changes, and the migration guide contains
>> behavior changes that need attention from users. Sometimes behavior changes
>> come with a legacy config to restore the old behavior. However, we still
>> lack a clear definition of behavior changes and I propose the following
>> definition:
>>
>> Behavior changes mean user-visible functional changes in a new release
>> via public APIs. This means new features, and even bug fixes that eliminate
>> NPE or correct query results, are behavior changes. Things like performance
>> improvement, code refactoring, and changes to unreleased APIs/features are
>> not. All behavior changes should be called out in the PR description. We
>> need to write an item in the migration guide (and probably legacy config)
>> for those that may break users when upgrading:
>>
>>- Bug fixes that change query results. Users may need to do backfill
>>to correct the existing data and must know about these correctness fixes.
>>- Bug fixes that change query schema. Users may need to update the
>>schema of the tables in their data pipelines and must know about these
>>changes.
>>- Remove configs
>>- Rename error class/condition
>>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>>function, remove parameters, add parameters, rename parameters, change
>>parameter default values, etc. These changes should be avoided in general,
>>or

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau
+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

> +1
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
>> +1 for next Monday.
>>
>> We can do more previews when the other features are ready for preview.
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for
>> cleaning up the error terminology and documentation! I've merged the 
>> first
>> PR and let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat
>> the Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have
>> a preview release. Let's collect more feedback on the ongoing projects 
>> and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>>> 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are 
>>> demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>> cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June
>>> 2024), and I think it's time to prepare for it and discuss the ongoing
>>> projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Jungtaek Lim
+1 love to see it!

On Thu, May 2, 2024 at 10:08 AM Holden Karau  wrote:

> +1 :) yay previews
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
>> +1
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>>> +1 for next Monday.
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>
 Next week sounds great! Thank you Wenchen!

 On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
 wrote:

> Yea I think a preview release won't hurt (without a branch cut). We
> don't need to wait for all the ongoing projects to be ready. How about we
> do a 4.0 preview release based on the current master branch next Monday?
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Hey all,
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard 
>> to
>> do that without a Preview release. So the sooner we make a Preview 
>> release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>> Thanks!
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
>> wrote:
>>
>>> Thank you all for the replies!
>>>
>>> To @Nicholas Chammas  : Thanks for
>>> cleaning up the error terminology and documentation! I've merged the 
>>> first
>>> PR and let's finish others before the 4.0 release.
>>> To @Dongjoon Hyun  : Thanks for driving
>>> the ANSI on by default effort! Now the vote has passed, let's flip the
>>> config and finish the DataFrame error context feature before 4.0.
>>> To @Jungtaek Lim  : Ack. We can treat
>>> the Streaming state store data source as completed for 4.0 then.
>>> To @Cheng Pan  : Yea we definitely should have
>>> a preview release. Let's collect more feedback on the ongoing projects 
>>> and
>>> then we can propose a date for the preview release.
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
 will we have preview release for 4.0.0 like we did for 2.0.0 and
 3.0.0?

 Thanks,
 Cheng Pan


 > On Apr 15, 2024, at 09:58, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 >
 > W.r.t. state data source - reader (SPARK-45511), there are
 several follow-up tickets, but we don't plan to address them soon. The
 current implementation is the final shape for Spark 4.0.0, unless 
 there are
 demands on the follow-up tickets.
 >
 > We may want to check the plan for transformWithState - my
 understanding is that we want to release the feature to 4.0.0, but 
 there
 are several remaining works to be done. While the tentative timeline 
 for
 releasing is June 2024, what would be the tentative timeline for the 
 RC cut?
 > (cc. Anish to add more context on the plan for transformWithState)
 >
 > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
 wrote:
 > Hi all,
 >
 > It's close to the previously proposed 4.0.0 release date (June
 2024), and I think it's time to prepare for it and discuss the ongoing
 projects:
 > •
 > ANSI by default
 > • Spark Connect GA
 > • Structured Logging
 > • Streaming state store data source
 > • new data type VARIANT
 > • STRING collation support
 > • Spark k8s operator versioning
 > Please help to add more items to this list that are missed here.
 I would like to volunteer as the release manager for Apache Spark 
 4.0.0 if
 there is no objection. Thank you all for the great work that fills 
 Spark
 4.0!
 >
 > Wenchen Fan


>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>