Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising
and I'm looking forward to seeing more feedback from real-world workloads.

On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A  wrote:

> Hi all,
>
> We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance
> tests in 3TB TPC-DS on 5 node Cascade Lake cluster. 2 queries bring about
> more than 1.5x performance and 37 queries bring more than 1.1x performance
> with AQE.  There is no query has significant performance degradations. The
> detail performance results and key configurations are shown in here
> .
> Based on the performance result, we recommend users to turn on AQE in spark
> 3.0. If encounter any bug or improvement when enable AQE, please help to
> file related JIRAs. Thanks.
>
>
>
> Regards,
>
> Jia Ke
>
>
>


Re: Request to document the direct relationship between other configurations

2020-02-13 Thread Jungtaek Lim
Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the
mix of 1 and 2-1, though 1 is even broken there.

It doesn't matter If we just discuss about one-time decision - it may be OK
to not to be strict on consistency, though it's not ideal. The thing is
that these kind of "preferences" are making confusion on review phase:
reviewers provide different comments and try to "educate" contributors on
their preferences. Expectations for such cases heavily depends on who
is/are reviewers of PR - there's no value on education.

The codebase is the reference of implicit rules/policies which would apply
to all contributors including newcomers. Let's just put our best efforts on
being consistent on codebase. (We should have consensus to do this anyway.)


On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon  wrote:

> I think it’s just fine as long as we’re consistent with the instances
> having the description, for instance:
>
>   When true and ‘spark.xx.xx’ is enabled, …
>
> I think this is 2-2 in most cases so far. I think we can reference other
> configuration keys in another configuration documentation by using
> ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have
> duplication problem in most cases.
>
> Being consistent with the current base is a general guidance if I am not
> mistaken. We have identified a problem and a good goal to reach.
> If we want to change, let's do it as our final goal. Otherwise, let's
> simplify it to reduce the overhead rather then having a policy for the
> mid-term specifically.
>
>
> 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이
> 작성:
>
>> I tend to agree that there should be a time to make thing be consistent
>> (and I'm very happy to see the new thread on discussion) and we may want to
>> take some practice in the interim.
>>
>> But for me it is not clear what is the practice in the interim. I pointed
>> out the problems of existing style and if we all agree the problems are
>> valid then we may need to fix it before start using it widely.
>>
>> For me the major question is "where" to put - at least there're two
>> different approaches:
>>
>> 1) put it to the description of `.enable` and describe the range of
>> impact (e.g.) put the description of "spark.A.enable" saying it will affect
>> the following configurations under "spark.A".
>> (I don't think it's good to enumerate all of affected configs, as
>> `spark.dynamicAllocation.enable` is telling it is fragile.)
>>
>> 2) put it to the description of all affected configurations and describe
>> which configuration is the prerequisite to let this be effective. e.g. put
>> it on all configurations under `spark.dynamicAllocation` mentioning
>> `spark.dynamicAllocation.enable` should be enabled to make this be
>> effective.
>>
>> (I intended to skip mentioning "cross reference". Let's be pragmatic.)
>>
>> 2) has also two ways to describe:
>>
>> 2-1) Just mention simply - like "When dynamic allocation is enabled,",
>> not pointing out the key to toggle. This hugely simplify the description,
>> though end users may have to do the second guess to find the key to toggle.
>> (It'll be intuitive when we structurize the configurations.)
>>
>> 2-2) Mention the key to toggle directly - like "This is effective only if
>> spark.A.enable is set to true.". It's going to be longer depending on how
>> long the configuration key is. Personally I'd feel verbose unless the key
>> to toggle is not placed to the spot we can infer, but others may have
>> different opinions.
>>
>> I may be missing some details, so please participate to add the details.
>> Otherwise we may want to choose the best one, and have a sample form of
>> description. (Once we reach here, it may be OK to add to the contribution
>> doc, as that is the thing we agree about.)
>>
>> Without the details it's going to be a some sort of "preference" which is
>> natural to have disagreement, hence it doesn't make sense someone is forced
>> to do something if something turns out to be "preference".
>>
>> On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon 
>> wrote:
>>
>>> Adding those information is already a more prevailing style at this
>>> moment, and this is usual to follow prevailing side if there isn't a
>>> specific reason.
>>> If there is confusion about this, I will explicitly add it into the
>>> guide (https://spark.apache.org/contributing.html). Let me know if this
>>> still confuses or disagree.
>>>
>>> 2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon 님이 작성:
>>>
 Yes, that's probably our final goal to revisit the configurations to
 make it structured and deduplicated documentation cleanly. +1.

 One point I would like to add is though to add such information to the
 documentation until we actually manage our final goal
 since probably it's going to take a while to revisit/fix and make it
 fully structured with the documentation.
 If we go more conservatively, we can add such information to the new
 configurations being added from now on at 

Unsubscribe

2020-02-13 Thread Akhil Anil
-- 
Sent from Gmail Mobile


Unsubscribe

2020-02-13 Thread Akhil Anil
-- 
Sent from Gmail Mobile


Re: Request to document the direct relationship between other configurations

2020-02-13 Thread Hyukjin Kwon
It's okay to just follow one prevailing style. The main point I would like
to say is the fact that we should *document* the direct relationship of
configurations.
For this case specifically, I don't think there is so much point here to
decide one hard requirement to follow for the mid-term. We have the final
goal to reach.

There are always many cases that requires committer's judgement. It is
impossible to put every single item to a guide line in practice.
So we usually trust their judgement in practice unless there's no explicit
objection. Here, my explicit objection is about no documentation for the
direct relationship
between configurations. Seems people think we should document this in
general as well.

In practice, I don't particularly mind what style is used in fact. I
vaguely guess the style isn't an issue to many other committers in general.
If this is the case, let's open a separate thread to discuss to confirm one
style - this wasn't the main issue I wanted to address in this thread.

Shall we conclude this thread by deciding to document the direct
relationship between configurations preferably in one prevailing style?


2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim 님이 작성:

> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the
> mix of 1 and 2-1, though 1 is even broken there.
>
> It doesn't matter If we just discuss about one-time decision - it may be
> OK to not to be strict on consistency, though it's not ideal. The thing is
> that these kind of "preferences" are making confusion on review phase:
> reviewers provide different comments and try to "educate" contributors on
> their preferences. Expectations for such cases heavily depends on who
> is/are reviewers of PR - there's no value on education.
>
> The codebase is the reference of implicit rules/policies which would apply
> to all contributors including newcomers. Let's just put our best efforts on
> being consistent on codebase. (We should have consensus to do this anyway.)
>
>
> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon  wrote:
>
>> I think it’s just fine as long as we’re consistent with the instances
>> having the description, for instance:
>>
>>   When true and ‘spark.xx.xx’ is enabled, …
>>
>> I think this is 2-2 in most cases so far. I think we can reference other
>> configuration keys in another configuration documentation by using
>> ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have
>> duplication problem in most cases.
>>
>> Being consistent with the current base is a general guidance if I am not
>> mistaken. We have identified a problem and a good goal to reach.
>> If we want to change, let's do it as our final goal. Otherwise, let's
>> simplify it to reduce the overhead rather then having a policy for the
>> mid-term specifically.
>>
>>
>> 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이
>> 작성:
>>
>>> I tend to agree that there should be a time to make thing be consistent
>>> (and I'm very happy to see the new thread on discussion) and we may want to
>>> take some practice in the interim.
>>>
>>> But for me it is not clear what is the practice in the interim. I
>>> pointed out the problems of existing style and if we all agree the problems
>>> are valid then we may need to fix it before start using it widely.
>>>
>>> For me the major question is "where" to put - at least there're two
>>> different approaches:
>>>
>>> 1) put it to the description of `.enable` and describe the range of
>>> impact (e.g.) put the description of "spark.A.enable" saying it will affect
>>> the following configurations under "spark.A".
>>> (I don't think it's good to enumerate all of affected configs, as
>>> `spark.dynamicAllocation.enable` is telling it is fragile.)
>>>
>>> 2) put it to the description of all affected configurations and describe
>>> which configuration is the prerequisite to let this be effective. e.g. put
>>> it on all configurations under `spark.dynamicAllocation` mentioning
>>> `spark.dynamicAllocation.enable` should be enabled to make this be
>>> effective.
>>>
>>> (I intended to skip mentioning "cross reference". Let's be pragmatic.)
>>>
>>> 2) has also two ways to describe:
>>>
>>> 2-1) Just mention simply - like "When dynamic allocation is enabled,",
>>> not pointing out the key to toggle. This hugely simplify the description,
>>> though end users may have to do the second guess to find the key to toggle.
>>> (It'll be intuitive when we structurize the configurations.)
>>>
>>> 2-2) Mention the key to toggle directly - like "This is effective only
>>> if spark.A.enable is set to true.". It's going to be longer depending on
>>> how long the configuration key is. Personally I'd feel verbose unless the
>>> key to toggle is not placed to the spot we can infer, but others may have
>>> different opinions.
>>>
>>> I may be missing some details, so please participate to add the details.
>>> Otherwise we may want to choose the best one, and have a sample form of
>>> description. (Once we reach here, it may be OK

RE: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Jia, Ke A
Hi Amogh,
Thanks for your interest in AQE work.

> Were any table stats available for TPC-DS during the runs ?
   We used the default configurations and didn't set special configurations 
(such as CBO) to collect the table stats both enable and disable AQE. And AQE 
mainly rely on the runtime statistic for further optimization not table stats. 
So it seems the effect of table stats may be small to this benchmark tests. 
Thanks.

Regards,
Jia Ke

From: Amogh Margoor 
Sent: Friday, February 14, 2020 5:02 AM
To: Wenchen Fan 
Cc: Jia, Ke A ; dev@spark.apache.org
Subject: Re: Adaptive Query Execution performance results in 3TB TPC-DS

Thanks Jia Ke for the numbers and they look promising.
Were any table stats available for TPC-DS during the runs ?

On Thu, Feb 13, 2020 at 4:07 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Thanks for providing the benchmark numbers! The result is very promising and 
I'm looking forward to seeing more feedback from real-world workloads.

On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A 
mailto:ke.a@intel.com>> wrote:
Hi all,
We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance tests 
in 3TB TPC-DS on 5 node Cascade Lake cluster. 2 queries bring about more than 
1.5x performance and 37 queries bring more than 1.1x performance with AQE.  
There is no query has significant performance degradations. The detail 
performance results and key configurations are shown in 
here.
 Based on the performance result, we recommend users to turn on AQE in spark 
3.0. If encounter any bug or improvement when enable AQE, please help to file 
related JIRAs. Thanks.

Regards,
Jia Ke



Re: Request to document the direct relationship between other configurations

2020-02-13 Thread Jungtaek Lim
OK I agree this is going forward as we decided the final goal and it seems
someone starts to work on. In the meanwhile we agree about documenting the
direct relationship, and which style to use is "open".

Thanks again to initiate the discussion thread - this thread led the
following thread for the final goal.

On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon  wrote:

> It's okay to just follow one prevailing style. The main point I would like
> to say is the fact that we should *document* the direct relationship of
> configurations.
> For this case specifically, I don't think there is so much point here to
> decide one hard requirement to follow for the mid-term. We have the final
> goal to reach.
>
> There are always many cases that requires committer's judgement. It is
> impossible to put every single item to a guide line in practice.
> So we usually trust their judgement in practice unless there's no explicit
> objection. Here, my explicit objection is about no documentation for the
> direct relationship
> between configurations. Seems people think we should document this in
> general as well.
>
> In practice, I don't particularly mind what style is used in fact. I
> vaguely guess the style isn't an issue to many other committers in general.
> If this is the case, let's open a separate thread to discuss to confirm
> one style - this wasn't the main issue I wanted to address in this thread.
>
> Shall we conclude this thread by deciding to document the direct
> relationship between configurations preferably in one prevailing style?
>
>
> 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim 님이
> 작성:
>
>> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the
>> mix of 1 and 2-1, though 1 is even broken there.
>>
>> It doesn't matter If we just discuss about one-time decision - it may be
>> OK to not to be strict on consistency, though it's not ideal. The thing is
>> that these kind of "preferences" are making confusion on review phase:
>> reviewers provide different comments and try to "educate" contributors on
>> their preferences. Expectations for such cases heavily depends on who
>> is/are reviewers of PR - there's no value on education.
>>
>> The codebase is the reference of implicit rules/policies which would
>> apply to all contributors including newcomers. Let's just put our best
>> efforts on being consistent on codebase. (We should have consensus to do
>> this anyway.)
>>
>>
>> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon 
>> wrote:
>>
>>> I think it’s just fine as long as we’re consistent with the instances
>>> having the description, for instance:
>>>
>>>   When true and ‘spark.xx.xx’ is enabled, …
>>>
>>> I think this is 2-2 in most cases so far. I think we can reference other
>>> configuration keys in another configuration documentation by using
>>> ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have
>>> duplication problem in most cases.
>>>
>>> Being consistent with the current base is a general guidance if I am not
>>> mistaken. We have identified a problem and a good goal to reach.
>>> If we want to change, let's do it as our final goal. Otherwise, let's
>>> simplify it to reduce the overhead rather then having a policy for the
>>> mid-term specifically.
>>>
>>>
>>> 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이
>>> 작성:
>>>
 I tend to agree that there should be a time to make thing be consistent
 (and I'm very happy to see the new thread on discussion) and we may want to
 take some practice in the interim.

 But for me it is not clear what is the practice in the interim. I
 pointed out the problems of existing style and if we all agree the problems
 are valid then we may need to fix it before start using it widely.

 For me the major question is "where" to put - at least there're two
 different approaches:

 1) put it to the description of `.enable` and describe the range of
 impact (e.g.) put the description of "spark.A.enable" saying it will affect
 the following configurations under "spark.A".
 (I don't think it's good to enumerate all of affected configs, as
 `spark.dynamicAllocation.enable` is telling it is fragile.)

 2) put it to the description of all affected configurations and
 describe which configuration is the prerequisite to let this be effective.
 e.g. put it on all configurations under `spark.dynamicAllocation`
 mentioning `spark.dynamicAllocation.enable` should be enabled to make this
 be effective.

 (I intended to skip mentioning "cross reference". Let's be pragmatic.)

 2) has also two ways to describe:

 2-1) Just mention simply - like "When dynamic allocation is enabled,",
 not pointing out the key to toggle. This hugely simplify the description,
 though end users may have to do the second guess to find the key to toggle.
 (It'll be intuitive when we structurize the configurations.)

 2-2) Mention the key to toggle directly - li

Re: SparkGraph review process

2020-02-13 Thread kant kodali
any update on this? Is spark graph going to make it into Spark or no?

On Mon, Oct 14, 2019 at 12:26 PM Holden Karau  wrote:

> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
>
> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li  wrote:
>
>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>
>>
>> This concern is valid. I think we should start the vote to ensure the
>> whole community is aware of the risk and take the responsibility to
>> maintain this in the long term.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> Xiangrui Meng  于2019年10月4日周五 下午12:27写道:
>>
>>> Hi all,
>>>
>>> I want to clarify my role first to avoid misunderstanding. I'm an
>>> individual contributor here. My work on the graph SPIP as well as other
>>> Spark features I contributed to are not associated with my employer. It
>>> became quite challenging for me to keep track of the graph SPIP work due to
>>> less available time at home.
>>>
>>> On retrospective, we should have involved more Spark devs and committers
>>> early on so there is no single point of failure, i.e., me. Hopefully it is
>>> not too late to fix. I summarize my thoughts here to help onboard other
>>> reviewers:
>>>
>>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>>
>>> 2. Overloading helper methods. MLlib used to have several overloaded
>>> helper methods for each algorithm, which later became a major maintenance
>>> burden. Builders and setters/getters are more maintainable. I will comment
>>> again on the PR.
>>>
>>> 3. The proposed API partitions graph into sub-graphs, as described in
>>> the property graph model. It is unclear to me how it would affect query
>>> performance because it requires SQL optimizer to correctly recognize data
>>> from the same source and make execution efficient.
>>>
>>> 4. The feature, although originally targeted for Spark 3.0, should not
>>> be a Spark 3.0 release blocker because it doesn't require breaking changes.
>>> If we miss the code freeze deadline, we can introduce a build flag to
>>> exclude the module from the official release/distribution, and then make it
>>> default once the module is ready.
>>>
>>> 5. If unfortunately we still don't see sufficient committer reviews, I
>>> think the best option would be submitting the work to Apache Incubator
>>> instead to unblock the work. But maybe it is too earlier to discuss this
>>> option.
>>>
>>> It would be great if other committers can offer help on the review!
>>> Really appreciated!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg 
>>> wrote:
>>>
 Hello dear Spark community

 We are the developers behind the SparkGraph SPIP, which is a project
 created out of our work on openCypher Morpheus (
 https://github.com/opencypher/morpheus). During this year we have
 collaborated with mainly Xiangrui Meng of Databricks to define and develop
 a new SparkGraph module based on our experience from working on Morpheus.
 Morpheus - formerly known as "Cypher for Apache Spark" - has been in
 development for over 3 years and matured in its API and implementation.

 The SPIP work has been on hold for a period of time now, as priorities
 at Databricks have changed which has occupied Xiangrui's time (as well as
 other happenings). As you may know, the latest API PR (
 https://github.com/apache/spark/pull/24851) is blocking us from moving
 forward with the implementation.

 In an attempt to not lose track of this project we now reach out to you
 to ask whether there are any Spark committers in the community who would be
 prepared to commit to helping us review and merge our code contributions to
 Apache Spark? We are not asking for lots of direct development support, as
 we believe we have the implementation more