Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Dongjoon Hyun
Hi, Karen.

Are you saying that Spark 3 has to have all deprecated 2.x APIs?
Could you tell us what is your criteria for `unnecessarily` or
`necessarily`?

> the migration process from Spark 2 to Spark 3 unnecessarily painful.

Bests,
Dongjoon.


On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
wrote:

> Hi all,
>
> I am concerned that the API-breaking changes in SPARK-25908 (as well as
> SPARK-16775, and potentially others) will make the migration process from
> Spark 2 to Spark 3 unnecessarily painful. For example, the removal of
> SQLContext.getOrCreate will break a large number of libraries currently
> built on Spark 2.
>
> Even if library developers do not use deprecated APIs, API changes between
> 2.x and 3.x will result in inconsistencies that require hacking around. For
> a fairly small and new (2.4.3+) genomics library, I had to create a number
> of shims (https://github.com/projectglow/glow/pull/155) for the source and
> test code due to API changes in SPARK-25393, SPARK-27328, SPARK-28744.
>
> It would be best practice to avoid breaking existing APIs to ease library
> development. To avoid dealing with similar deprecated API issues down the
> road, we should practice more prudence when considering new API proposals.
>
> I'd love to see more discussion on this.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Holden Karau
So my understanding would be that to provide a reasonable migration path
we’d want the replacement of the deprecated API to also exist in 2.4 this
way libraries and programs can dual target during the migration process.

Now that isn’t always going to be doable, but certainly worth looking at
the situations where we aren’t providing a smooth migration path and making
sure it’s the best thing to do.

On Wed, Feb 19, 2020 at 2:10 PM Dongjoon Hyun 
wrote:

> Hi, Karen.
>
> Are you saying that Spark 3 has to have all deprecated 2.x APIs?
> Could you tell us what is your criteria for `unnecessarily` or
> `necessarily`?
>
> > the migration process from Spark 2 to Spark 3 unnecessarily painful.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
> wrote:
>
>> Hi all,
>>
>> I am concerned that the API-breaking changes in SPARK-25908 (as well as
>> SPARK-16775, and potentially others) will make the migration process from
>> Spark 2 to Spark 3 unnecessarily painful. For example, the removal of
>> SQLContext.getOrCreate will break a large number of libraries currently
>> built on Spark 2.
>>
>> Even if library developers do not use deprecated APIs, API changes between
>> 2.x and 3.x will result in inconsistencies that require hacking around.
>> For
>> a fairly small and new (2.4.3+) genomics library, I had to create a number
>> of shims (https://github.com/projectglow/glow/pull/155) for the source
>> and
>> test code due to API changes in SPARK-25393, SPARK-27328, SPARK-28744.
>>
>> It would be best practice to avoid breaking existing APIs to ease library
>> development. To avoid dealing with similar deprecated API issues down the
>> road, we should practice more prudence when considering new API proposals.
>>
>> I'd love to see more discussion on this.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-19 Thread Dongjoon Hyun
Yes, right. We need to choose the best way case by case.
The following level of `Runtime warning` also sounds reasonable to me.

> Maybe we should only log the warning once per Spark application.

Bests,
Dongjoon.

On Tue, Feb 18, 2020 at 1:13 AM Wenchen Fan  wrote:

> I don't know what's the best way to deprecate an SQL function. Runtime
> warning can be annoying if it keeps coming out. Maybe we should only log
> the warning once per Spark application.
>
> On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback, Wenchen, Maxim, and Takeshi.
>>
>> 1. "we would also promote the SQL standard TRIM syntax"
>> 2. "we could output a warning with a notice about the order of
>> parameters"
>> 3. "it looks nice to make these (two parameters) trim functions
>> unrecommended in future releases"
>>
>> Yes, in case of reverting SPARK-28093, we had better deprecate these
>> two-parameter SQL function invocations. It's because this is really
>> esoteric and even inconsistent inside Spark (Scala API also follows
>> PostgresSQL/Presto style like the following.)
>>
>> def trim(e: Column, trimString: String)
>>
>> If we keep this situation in 3.0.0 release (a major release), it means
>> Apache Spark will be forever.
>> And, it becomes worse and worse because it leads more users to fall into
>> this trap.
>>
>> There are a few deprecation ways. I believe (3) can be a proper next step
>> in case of reverting because (2) is infeasible and (1) is considered
>> insufficient because it's silent when we do SPARK-28093. We need non-silent
>> (noisy) one in this case. Technically, (3) can be done in
>> `Analyzer.ResolveFunctions`.
>>
>> 1. Documentation-only: Removing example and add migration guide
>> 2. Compile-time warning by annotation: This is not an option for SQL
>> function in SQL string.
>> 3. Runtime warning with a directional guide
>>(log.warn("... USE TRIM(trimStr FROM str) INSTEAD")
>>
>> How do you think about (3)?
>>
>> Bests,
>> Dongjoon.
>>
>> On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro 
>> wrote:
>>
>>> The revert looks fine to me for keeping the compatibility.
>>> Also, I think the different orders between the systems easily lead to
>>> mistakes, so
>>> , as Wenchen suggested, it looks nice to make these (two parameters)
>>> trim functions
>>> unrecommended in future releases:
>>> https://github.com/apache/spark/pull/27540#discussion_r377682518
>>> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk 
>>> wrote:
>>>
 Also if we look at possible combination of trim parameters:
 1. foldable srcStr + foldable trimStr
 2. foldable srcStr + non-foldable trimStr
 3. non-foldable srcStr + foldable trimStr
 4. non-foldable srcStr + non-foldable trimStr

 The case # 2 seems a rare case, and # 3 is probably the most common
 case. Once we see the second case, we could output a warning with a notice
 about the order of parameters.

 Maxim Gekk

 Software Engineer

 Databricks, Inc.


 On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan 
 wrote:

> It's unfortunate that we don't have a clear document to talk about
> breaking changes (I'm working on it BTW). I believe the general guidance
> is: *avoid breaking changes unless we have to*. For example, the
> previous result was so broken that we have to fix it, moving to Scala 2.12
> makes us have to break some APIs, etc.
>
> For this particular case, do we have to switch the parameter order?
> It's different from some systems, the order was not decided explicitly, 
> but
> I don't think they are strong reasons. People from RDBMS should use the 
> SQL
> standard TRIM syntax more often. People using prior Spark versions should
> have figured out the parameter order of Spark TRIM (there was no document)
> and adjusted their queries. There is no such standard that defines the
> parameter order of the TRIM function.
>
> In the long term, we would also promote the SQL standard TRIM syntax.
> I don't see much benefit of "fixing" the parameter order that worth to 
> make
> a breaking change.
>
> Thanks,
> Wenchen
>
>
>
> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun 
> wrote:
>
>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>> and TRIM(trimStr FROM str) syntax.
>>
>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>
>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Hi, All.
>>>
>>> I'm sending this email because the Apache Spark committers had
>>> better have a consistent point of views for the upcoming PRs. And, the
>>> community policy is the way to lead the community members transparently 
>>> and
>

Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Xiao Li
Like https://github.com/apache/spark/pull/23131, we added back unionAll.

We might need to double check whether we removed some widely used APIs in
this release before RC. If the maintenance costs are small, keeping some
deprecated APIs look reasonable to me. This can help the adoption of Spark
3.0. We need to discuss the APIs case by case.

Xiao

On Wed, Feb 19, 2020 at 2:14 PM Holden Karau  wrote:

> So my understanding would be that to provide a reasonable migration path
> we’d want the replacement of the deprecated API to also exist in 2.4 this
> way libraries and programs can dual target during the migration process.
>
> Now that isn’t always going to be doable, but certainly worth looking at
> the situations where we aren’t providing a smooth migration path and making
> sure it’s the best thing to do.
>
> On Wed, Feb 19, 2020 at 2:10 PM Dongjoon Hyun 
> wrote:
>
>> Hi, Karen.
>>
>> Are you saying that Spark 3 has to have all deprecated 2.x APIs?
>> Could you tell us what is your criteria for `unnecessarily` or
>> `necessarily`?
>>
>> > the migration process from Spark 2 to Spark 3 unnecessarily painful.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am concerned that the API-breaking changes in SPARK-25908 (as well as
>>> SPARK-16775, and potentially others) will make the migration process from
>>> Spark 2 to Spark 3 unnecessarily painful. For example, the removal of
>>> SQLContext.getOrCreate will break a large number of libraries currently
>>> built on Spark 2.
>>>
>>> Even if library developers do not use deprecated APIs, API changes
>>> between
>>> 2.x and 3.x will result in inconsistencies that require hacking around.
>>> For
>>> a fairly small and new (2.4.3+) genomics library, I had to create a
>>> number
>>> of shims (https://github.com/projectglow/glow/pull/155) for the source
>>> and
>>> test code due to API changes in SPARK-25393, SPARK-27328, SPARK-28744.
>>>
>>> It would be best practice to avoid breaking existing APIs to ease library
>>> development. To avoid dealing with similar deprecated API issues down the
>>> road, we should practice more prudence when considering new API
>>> proposals.
>>>
>>> I'd love to see more discussion on this.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 



Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Dongjoon Hyun
Sure. I understand the background of the following requests. So, it's a
good time to decide the criteria in order to start discussion.

1. "to provide a reasonable migration path we’d want the replacement of
the deprecated API to also exist in 2.4"
2. "We need to discuss the APIs case by case"

For now, it's unclear what is `necessarily painful`, what is "widely used
APIs", or how small is "the maintenance costs are small".

I'm wondering if the goal of Apache Spark 3.0.0 is being 100% backward
compatible with Apache Spark 2.4.5 like Apache Kafka?
Are we going to revert all changes? If there is a clear criteria, we didn't
need to do the clean up for that long period of 3.0.0.

BTW, to be clear, we are talking about 2.4.5 and 3.0.0 compatibility in
this thread.

Bests,
Dongjoon.


On Wed, Feb 19, 2020 at 2:20 PM Xiao Li  wrote:

> Like https://github.com/apache/spark/pull/23131, we added back unionAll.
>
> We might need to double check whether we removed some widely used APIs in
> this release before RC. If the maintenance costs are small, keeping some
> deprecated APIs look reasonable to me. This can help the adoption of Spark
> 3.0. We need to discuss the APIs case by case.
>
> Xiao
>
> On Wed, Feb 19, 2020 at 2:14 PM Holden Karau  wrote:
>
>> So my understanding would be that to provide a reasonable migration path
>> we’d want the replacement of the deprecated API to also exist in 2.4 this
>> way libraries and programs can dual target during the migration process.
>>
>> Now that isn’t always going to be doable, but certainly worth looking at
>> the situations where we aren’t providing a smooth migration path and making
>> sure it’s the best thing to do.
>>
>> On Wed, Feb 19, 2020 at 2:10 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Karen.
>>>
>>> Are you saying that Spark 3 has to have all deprecated 2.x APIs?
>>> Could you tell us what is your criteria for `unnecessarily` or
>>> `necessarily`?
>>>
>>> > the migration process from Spark 2 to Spark 3 unnecessarily painful.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
>>> wrote:
>>>
 Hi all,

 I am concerned that the API-breaking changes in SPARK-25908 (as well as
 SPARK-16775, and potentially others) will make the migration process
 from
 Spark 2 to Spark 3 unnecessarily painful. For example, the removal of
 SQLContext.getOrCreate will break a large number of libraries currently
 built on Spark 2.

 Even if library developers do not use deprecated APIs, API changes
 between
 2.x and 3.x will result in inconsistencies that require hacking around.
 For
 a fairly small and new (2.4.3+) genomics library, I had to create a
 number
 of shims (https://github.com/projectglow/glow/pull/155) for the source
 and
 test code due to API changes in SPARK-25393, SPARK-27328, SPARK-28744.

 It would be best practice to avoid breaking existing APIs to ease
 library
 development. To avoid dealing with similar deprecated API issues down
 the
 road, we should practice more prudence when considering new API
 proposals.

 I'd love to see more discussion on this.



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> 
>


Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Jungtaek Lim
Apache Spark 2.0 was released in July 2016. Assuming the project has been
trying the best to follow the semantic versioning, it is "more than three
years" to wait for the breaking changes. What the community misses to
address necessary breaking changes would be going to be technical debts for
another 3+ years.

As the PRs removing deprecated APIs were pointed out first, I'm not sure
about the reason. I roughly remember that these PRs target to remove
deprecated APIs deprecated at couple of minor versions before. If then
what's the matter?

If the deprecation messages don't kindly guide about alternatives then
that's the major problem the community should concern and try to fix, but
that's another problem. The community doesn't deprecate the API just for
fun. Every deprecation has the reason, and not removing the API doesn't
make sense unless the community has mistaken for a reason of deprecation.

If the community really would like to build some (soft) rules/policies on
deprecation, I would only imagine 2 items -

1. define "minimum release to live" (either each deprecated API or globally)
2. never skip describing the reason of deprecation and try best to describe
alternative works same or similar - if the alternative doesn't work exactly
same, also describe the difference (optionally, maybe)

I cannot imagine other problems at all about deprecation.

On Thu, Feb 20, 2020 at 7:36 AM Dongjoon Hyun 
wrote:

> Sure. I understand the background of the following requests. So, it's a
> good time to decide the criteria in order to start discussion.
>
> 1. "to provide a reasonable migration path we’d want the replacement
> of the deprecated API to also exist in 2.4"
> 2. "We need to discuss the APIs case by case"
>
> For now, it's unclear what is `necessarily painful`, what is "widely used
> APIs", or how small is "the maintenance costs are small".
>
> I'm wondering if the goal of Apache Spark 3.0.0 is being 100% backward
> compatible with Apache Spark 2.4.5 like Apache Kafka?
> Are we going to revert all changes? If there is a clear criteria, we
> didn't need to do the clean up for that long period of 3.0.0.
>
> BTW, to be clear, we are talking about 2.4.5 and 3.0.0 compatibility in
> this thread.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Feb 19, 2020 at 2:20 PM Xiao Li  wrote:
>
>> Like https://github.com/apache/spark/pull/23131, we added back unionAll.
>>
>> We might need to double check whether we removed some widely used APIs in
>> this release before RC. If the maintenance costs are small, keeping some
>> deprecated APIs look reasonable to me. This can help the adoption of Spark
>> 3.0. We need to discuss the APIs case by case.
>>
>> Xiao
>>
>> On Wed, Feb 19, 2020 at 2:14 PM Holden Karau 
>> wrote:
>>
>>> So my understanding would be that to provide a reasonable migration path
>>> we’d want the replacement of the deprecated API to also exist in 2.4 this
>>> way libraries and programs can dual target during the migration process.
>>>
>>> Now that isn’t always going to be doable, but certainly worth looking at
>>> the situations where we aren’t providing a smooth migration path and making
>>> sure it’s the best thing to do.
>>>
>>> On Wed, Feb 19, 2020 at 2:10 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, Karen.

 Are you saying that Spark 3 has to have all deprecated 2.x APIs?
 Could you tell us what is your criteria for `unnecessarily` or
 `necessarily`?

 > the migration process from Spark 2 to Spark 3 unnecessarily painful.

 Bests,
 Dongjoon.


 On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
 wrote:

> Hi all,
>
> I am concerned that the API-breaking changes in SPARK-25908 (as well as
> SPARK-16775, and potentially others) will make the migration process
> from
> Spark 2 to Spark 3 unnecessarily painful. For example, the removal of
> SQLContext.getOrCreate will break a large number of libraries currently
> built on Spark 2.
>
> Even if library developers do not use deprecated APIs, API changes
> between
> 2.x and 3.x will result in inconsistencies that require hacking
> around. For
> a fairly small and new (2.4.3+) genomics library, I had to create a
> number
> of shims (https://github.com/projectglow/glow/pull/155) for the
> source and
> test code due to API changes in SPARK-25393, SPARK-27328, SPARK-28744.
>
> It would be best practice to avoid breaking existing APIs to ease
> library
> development. To avoid dealing with similar deprecated API issues down
> the
> road, we should practice more prudence when considering new API
> proposals.
>
> I'd love to see more discussion on this.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
>>> T

Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Jungtaek Lim
I think I was too rushed to read and focused on the first sentence of
Karen's input. Sorry about that.

As I said I'm not sure I can agree with the point of deprecation and
breaking changes of APIs, the thread has another topic which seems to be a
good input - practice on new API proposal. I feel it should be different
thread to discuss, though.

Maybe we can make the deprecation of API as "heavy-weight" operation to
mitigate the impact a bit, like requiring discussion thread to reach
consensus before going through PR. For now, you have no idea which API is
going to be deprecated and why if you only subscribe to dev@. Even you
subscribe the issue@ you would miss it among flooded issues.

Personally I feel the root cause as dev@ is very quiet compared to the
volume of PRs the community gets and the impacts of changes these PRs have
been made. I agree we should have balance on this to avoid restricting
ourselves too much, but I feel there's no balance now - most things are
just going through PRs without discussion. It would be ideal we have time
to consider on this.


On Thu, Feb 20, 2020 at 8:50 AM Jungtaek Lim 
wrote:

> Apache Spark 2.0 was released in July 2016. Assuming the project has been
> trying the best to follow the semantic versioning, it is "more than three
> years" to wait for the breaking changes. What the community misses to
> address necessary breaking changes would be going to be technical debts for
> another 3+ years.
>
> As the PRs removing deprecated APIs were pointed out first, I'm not sure
> about the reason. I roughly remember that these PRs target to remove
> deprecated APIs deprecated at couple of minor versions before. If then
> what's the matter?
>
> If the deprecation messages don't kindly guide about alternatives then
> that's the major problem the community should concern and try to fix, but
> that's another problem. The community doesn't deprecate the API just for
> fun. Every deprecation has the reason, and not removing the API doesn't
> make sense unless the community has mistaken for a reason of deprecation.
>
> If the community really would like to build some (soft) rules/policies on
> deprecation, I would only imagine 2 items -
>
> 1. define "minimum release to live" (either each deprecated API or
> globally)
> 2. never skip describing the reason of deprecation and try best to
> describe alternative works same or similar - if the alternative doesn't
> work exactly same, also describe the difference (optionally, maybe)
>
> I cannot imagine other problems at all about deprecation.
>
> On Thu, Feb 20, 2020 at 7:36 AM Dongjoon Hyun 
> wrote:
>
>> Sure. I understand the background of the following requests. So, it's a
>> good time to decide the criteria in order to start discussion.
>>
>> 1. "to provide a reasonable migration path we’d want the replacement
>> of the deprecated API to also exist in 2.4"
>> 2. "We need to discuss the APIs case by case"
>>
>> For now, it's unclear what is `necessarily painful`, what is "widely used
>> APIs", or how small is "the maintenance costs are small".
>>
>> I'm wondering if the goal of Apache Spark 3.0.0 is being 100% backward
>> compatible with Apache Spark 2.4.5 like Apache Kafka?
>> Are we going to revert all changes? If there is a clear criteria, we
>> didn't need to do the clean up for that long period of 3.0.0.
>>
>> BTW, to be clear, we are talking about 2.4.5 and 3.0.0 compatibility in
>> this thread.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Feb 19, 2020 at 2:20 PM Xiao Li  wrote:
>>
>>> Like https://github.com/apache/spark/pull/23131, we added back
>>> unionAll.
>>>
>>> We might need to double check whether we removed some widely used APIs
>>> in this release before RC. If the maintenance costs are small, keeping some
>>> deprecated APIs look reasonable to me. This can help the adoption of Spark
>>> 3.0. We need to discuss the APIs case by case.
>>>
>>> Xiao
>>>
>>> On Wed, Feb 19, 2020 at 2:14 PM Holden Karau 
>>> wrote:
>>>
 So my understanding would be that to provide a reasonable migration
 path we’d want the replacement of the deprecated API to also exist in 2.4
 this way libraries and programs can dual target during the migration
 process.

 Now that isn’t always going to be doable, but certainly worth looking
 at the situations where we aren’t providing a smooth migration path and
 making sure it’s the best thing to do.

 On Wed, Feb 19, 2020 at 2:10 PM Dongjoon Hyun 
 wrote:

> Hi, Karen.
>
> Are you saying that Spark 3 has to have all deprecated 2.x APIs?
> Could you tell us what is your criteria for `unnecessarily` or
> `necessarily`?
>
> > the migration process from Spark 2 to Spark 3 unnecessarily painful.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Feb 18, 2020 at 4:55 PM Karen Feng 
> wrote:
>
>> Hi all,
>>
>> I am concerned that the API-breaking changes in SPARK-25908 (as well
>> as
>> S