date:20200214

Re: Request to document the direct relationship between other configurations

2020-02-14 Thread Hyukjin Kwon

Thanks Jungtaek!

2020년 2월 14일 (금) 오후 3:57, Jungtaek Lim 님이 작성:

> OK I agree this is going forward as we decided the final goal and it seems
> someone starts to work on. In the meanwhile we agree about documenting the
> direct relationship, and which style to use is "open".
>
> Thanks again to initiate the discussion thread - this thread led the
> following thread for the final goal.
>
> On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon  wrote:
>
>> It's okay to just follow one prevailing style. The main point I would
>> like to say is the fact that we should *document* the direct
>> relationship of configurations.
>> For this case specifically, I don't think there is so much point here to
>> decide one hard requirement to follow for the mid-term. We have the final
>> goal to reach.
>>
>> There are always many cases that requires committer's judgement. It is
>> impossible to put every single item to a guide line in practice.
>> So we usually trust their judgement in practice unless there's no
>> explicit objection. Here, my explicit objection is about no documentation
>> for the direct relationship
>> between configurations. Seems people think we should document this in
>> general as well.
>>
>> In practice, I don't particularly mind what style is used in fact. I
>> vaguely guess the style isn't an issue to many other committers in general.
>> If this is the case, let's open a separate thread to discuss to confirm
>> one style - this wasn't the main issue I wanted to address in this thread.
>>
>> Shall we conclude this thread by deciding to document the direct
>> relationship between configurations preferably in one prevailing style?
>>
>>
>> 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim 님이
>> 작성:
>>
>>> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the
>>> mix of 1 and 2-1, though 1 is even broken there.
>>>
>>> It doesn't matter If we just discuss about one-time decision - it may be
>>> OK to not to be strict on consistency, though it's not ideal. The thing is
>>> that these kind of "preferences" are making confusion on review phase:
>>> reviewers provide different comments and try to "educate" contributors on
>>> their preferences. Expectations for such cases heavily depends on who
>>> is/are reviewers of PR - there's no value on education.
>>>
>>> The codebase is the reference of implicit rules/policies which would
>>> apply to all contributors including newcomers. Let's just put our best
>>> efforts on being consistent on codebase. (We should have consensus to do
>>> this anyway.)
>>>
>>>
>>> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon 
>>> wrote:
>>>
 I think it’s just fine as long as we’re consistent with the instances
 having the description, for instance:

   When true and ‘spark.xx.xx’ is enabled, …

 I think this is 2-2 in most cases so far. I think we can reference
 other configuration keys in another configuration documentation by using
 ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have
 duplication problem in most cases.

 Being consistent with the current base is a general guidance if I am
 not mistaken. We have identified a problem and a good goal to reach.
 If we want to change, let's do it as our final goal. Otherwise, let's
 simplify it to reduce the overhead rather then having a policy for the
 mid-term specifically.


 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이
 작성:

> I tend to agree that there should be a time to make thing be
> consistent (and I'm very happy to see the new thread on discussion) and we
> may want to take some practice in the interim.
>
> But for me it is not clear what is the practice in the interim. I
> pointed out the problems of existing style and if we all agree the 
> problems
> are valid then we may need to fix it before start using it widely.
>
> For me the major question is "where" to put - at least there're two
> different approaches:
>
> 1) put it to the description of `.enable` and describe the range of
> impact (e.g.) put the description of "spark.A.enable" saying it will 
> affect
> the following configurations under "spark.A".
> (I don't think it's good to enumerate all of affected configs, as
> `spark.dynamicAllocation.enable` is telling it is fragile.)
>
> 2) put it to the description of all affected configurations and
> describe which configuration is the prerequisite to let this be effective.
> e.g. put it on all configurations under `spark.dynamicAllocation`
> mentioning `spark.dynamicAllocation.enable` should be enabled to make this
> be effective.
>
> (I intended to skip mentioning "cross reference". Let's be pragmatic.)
>
> 2) has also two ways to describe:
>
> 2-1) Just mention simply - like "When dynamic allocation is enabled,",
> not pointing out the key to toggle. This hugely simplify the description,
> though

Re: SparkGraph review process

2020-02-14 Thread Sean Owen

This will not be Spark 3.0, no.

On Fri, Feb 14, 2020 at 1:12 AM kant kodali  wrote:
>
> any update on this? Is spark graph going to make it into Spark or no?
>
> On Mon, Oct 14, 2019 at 12:26 PM Holden Karau  wrote:
>>
>> Maybe let’s ask the folks from Lightbend who helped with the previous scala 
>> upgrade for their thoughts?
>>
>> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li  wrote:

 1. On the technical side, my main concern is the runtime dependency on 
 org.opencypher:okapi-shade. okapi depends on several Scala libraries. We 
 came out with the solution to shade a few Scala libraries to avoid 
 pollution. However, I'm not super confident that the approach is 
 sustainable for two reasons: a) there exists no proper shading libraries 
 for Scala, 2) We will have to wait for upgrades from those Scala libraries 
 before we can upgrade Spark to use a newer Scala version. So it would be 
 great if some Scala experts can help review the current implementation and 
 help assess the risk.
>>>
>>>
>>> This concern is valid. I think we should start the vote to ensure the whole 
>>> community is aware of the risk and take the responsibility to maintain this 
>>> in the long term.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> Xiangrui Meng  于2019年10月4日周五 下午12:27写道：

 Hi all,

 I want to clarify my role first to avoid misunderstanding. I'm an 
 individual contributor here. My work on the graph SPIP as well as other 
 Spark features I contributed to are not associated with my employer. It 
 became quite challenging for me to keep track of the graph SPIP work due 
 to less available time at home.

 On retrospective, we should have involved more Spark devs and committers 
 early on so there is no single point of failure, i.e., me. Hopefully it is 
 not too late to fix. I summarize my thoughts here to help onboard other 
 reviewers:

 1. On the technical side, my main concern is the runtime dependency on 
 org.opencypher:okapi-shade. okapi depends on several Scala libraries. We 
 came out with the solution to shade a few Scala libraries to avoid 
 pollution. However, I'm not super confident that the approach is 
 sustainable for two reasons: a) there exists no proper shading libraries 
 for Scala, 2) We will have to wait for upgrades from those Scala libraries 
 before we can upgrade Spark to use a newer Scala version. So it would be 
 great if some Scala experts can help review the current implementation and 
 help assess the risk.

 2. Overloading helper methods. MLlib used to have several overloaded 
 helper methods for each algorithm, which later became a major maintenance 
 burden. Builders and setters/getters are more maintainable. I will comment 
 again on the PR.

 3. The proposed API partitions graph into sub-graphs, as described in the 
 property graph model. It is unclear to me how it would affect query 
 performance because it requires SQL optimizer to correctly recognize data 
 from the same source and make execution efficient.

 4. The feature, although originally targeted for Spark 3.0, should not be 
 a Spark 3.0 release blocker because it doesn't require breaking changes. 
 If we miss the code freeze deadline, we can introduce a build flag to 
 exclude the module from the official release/distribution, and then make 
 it default once the module is ready.

 5. If unfortunately we still don't see sufficient committer reviews, I 
 think the best option would be submitting the work to Apache Incubator 
 instead to unblock the work. But maybe it is too earlier to discuss this 
 option.

 It would be great if other committers can offer help on the review! Really 
 appreciated!

 Best,
 Xiangrui

 On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg  wrote:
>
> Hello dear Spark community
>
> We are the developers behind the SparkGraph SPIP, which is a project 
> created out of our work on openCypher Morpheus 
> (https://github.com/opencypher/morpheus). During this year we have 
> collaborated with mainly Xiangrui Meng of Databricks to define and 
> develop a new SparkGraph module based on our experience from working on 
> Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has 
> been in development for over 3 years and matured in its API and 
> implementation.
>
> The SPIP work has been on hold for a period of time now, as priorities at 
> Databricks have changed which has occupied Xiangrui's time (as well as 
> other happenings). As you may know, the latest API PR 
> (https://github.com/apache/spark/pull/24851) is blocking us from moving 
> forward with the implementation.
>
> In an attempt to not lose track of this project we now reach out to you 
> to ask whether there are any Spark co

[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-14 Thread Dongjoon Hyun

Hi, All.

I'm sending this email because the Apache Spark committers had better have
a consistent point of views for the upcoming PRs. And, the community policy
is the way to lead the community members transparently and clearly for a
long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and
related test coverage for the future. One simple example is the following.

[SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark
function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional
non-standard these functions, too.)

Here is the summary of the current status.

++-+---+
| SQL Progressing Engine | TRIM Syntax | TRIM Function |
+--+
| Spark 3.0.0-preview/2  |  O  |   O   |
| PostgreSQL |  O  |   O   |
| Presto |  O  |   O   |
| MySQL  |  O(*)   |   X   |
| Oracle |  O  |   X   |
| MsSQL  |  O  |   X   |
| Terradata  |  O  |   X   |
| Hive   |  X  |   X   |
| Spark 1.6 ~ 2.2|  X  |   X   |
| Spark 2.3 ~ 2.4|  X  |   O(**)   |
++-+---+
* MySQL doesn't allow multiple trim character
* Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

postgres=# SELECT trim('yxTomxx', 'xyz');
btrim
---
Tom

presto:default> SELECT trim('yxTomxx', 'xyz');
_col0
---
Tom

spark-sql> SELECT trim('yxTomxx', 'xyz');
z

Here is our history to fix the above issue.

[SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
1. https://github.com/apache/spark/pull/24902
   (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
released.)
2. https://github.com/apache/spark/pull/24907
   (Merged 2019-06-20 for v2.3.4, but reverted)
3. https://github.com/apache/spark/pull/24908
   (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix
that in the maintenance releases. Please see the following references of
the decision.

https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these
esoteric functions for backward compatibility and the following reason.

> Reordering function parameters to match another system,
> for a method that is otherwise working correctly,
> sounds exactly like a cosmetic change to me.

> How can we silently change the parameter of an existing SQL function?
> I don't think this is a correctness issue as the SQL standard
> doesn't say that the function signature have to be trim(srcStr,
trimStr).

The concern and the point of views make sense.

My concerns are the followings.

1. This kind of esoteric differences are called `vendor-lock-in` stuffs
in a negative way.
   - It's difficult for new users to understand.
   - It's hard to migrate between Apache Spark and the others.
2. Although we did our bests, Apache Spark SQL has not been enough
always.
3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in
order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for
the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many
ways.

Bests,
Dongjoon.

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-14 Thread Dongjoon Hyun

Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I'm sending this email because the Apache Spark committers had better have
> a consistent point of views for the upcoming PRs. And, the community policy
> is the way to lead the community members transparently and clearly for a
> long term good.
>
> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
> Spark 3.0.0 is going to achieve many improvement in SQL areas.
>
> Especially, we invested lots of efforts to improve SQL ANSI compliance and
> related test coverage for the future. One simple example is the following.
>
> [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>
> With this support, we are able to move away from one of esoteric Spark
> function `TRIM/LTRIM/RTRIM`.
> (Note that the above syntax is ANSI standard, but we have additional
> non-standard these functions, too.)
>
> Here is the summary of the current status.
>
> ++-+---+
> | SQL Progressing Engine | TRIM Syntax | TRIM Function |
> +--+
> | Spark 3.0.0-preview/2  |  O  |   O   |
> | PostgreSQL |  O  |   O   |
> | Presto |  O  |   O   |
> | MySQL  |  O(*)   |   X   |
> | Oracle |  O  |   X   |
> | MsSQL  |  O  |   X   |
> | Terradata  |  O  |   X   |
> | Hive   |  X  |   X   |
> | Spark 1.6 ~ 2.2|  X  |   X   |
> | Spark 2.3 ~ 2.4|  X  |   O(**)   |
> ++-+---+
> * MySQL doesn't allow multiple trim character
> * Spark 2.3 ~ 2.4 have the function in a different way.
>
> Here is the illustrative example of the problem.
>
> postgres=# SELECT trim('yxTomxx', 'xyz');
> btrim
> ---
> Tom
>
> presto:default> SELECT trim('yxTomxx', 'xyz');
> _col0
> ---
> Tom
>
> spark-sql> SELECT trim('yxTomxx', 'xyz');
> z
>
> Here is our history to fix the above issue.
>
> [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
> 1. https://github.com/apache/spark/pull/24902
>(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
> released.)
> 2. https://github.com/apache/spark/pull/24907
>(Merged 2019-06-20 for v2.3.4, but reverted)
> 3. https://github.com/apache/spark/pull/24908
>(Merged 2019-06-21 for v2.4.4, but reverted)
>
> (2) and (3) were reverted before releases because we didn't want to fix
> that in the maintenance releases. Please see the following references of
> the decision.
>
> https://github.com/apache/spark/pull/24908#issuecomment-504799028
> (2.3)
> https://github.com/apache/spark/pull/24907#issuecomment-504799021
> (2.4)
>
> Now, there are some requests to revert SPARK-28093 and to keep these
> esoteric functions for backward compatibility and the following reason.
>
> > Reordering function parameters to match another system,
> > for a method that is otherwise working correctly,
> > sounds exactly like a cosmetic change to me.
>
> > How can we silently change the parameter of an existing SQL function?
> > I don't think this is a correctness issue as the SQL standard
> > doesn't say that the function signature have to be trim(srcStr,
> trimStr).
>
> The concern and the point of views make sense.
>
> My concerns are the followings.
>
> 1. This kind of esoteric differences are called `vendor-lock-in`
> stuffs in a negative way.
>- It's difficult for new users to understand.
>- It's hard to migrate between Apache Spark and the others.
> 2. Although we did our bests, Apache Spark SQL has not been enough
> always.
> 3. We need to do improvement in the future releases.
>
> In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093
> in order to keep this vendor-lock-in stuffs for backward-compatibility.
>
> What I want is building a consistent point of views in this category for
> the upcoming PR reviews.
>
> If we have a clear policy, we can save future community efforts in many
> ways.
>
> Bests,
> Dongjoon.
>

Re: Request to document the direct relationship between other configurations

Re: SparkGraph review process

[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

4 matches

Site Navigation

Mail list logo

Footer information