[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616590#comment-15616590
 ] 

Ron Hu commented on SPARK-17791:
--------------------------------

This JIRA is indeed complementary to Cost-Based Optimizer (or CBO) project.  
But we need to be more careful, and as a result we should do this together with 
CBO.  Let me explain it below.

Previously I commented in SPARK-17626 , we need detailed statistics such as 
number of distinct values for a join column and number of records of a table in 
order to decide fact tables and dimension tables.  Today we are collecting 
statistics to reliably predict fact tables and dimension tables in CBO project. 
 In addition, we are estimating selectivity for every relational algebra 
operator in CBO so that we can give reliable plan cardinality.

Without CBO's support, the author currently is using SQL hint to provide 
predicate selectivity.  Note that SQL hint does not work well as it is not 
automated and it just shows the current weakness of Spark SQL optimizer.  The 
author also uses initial table size to predict fact/dimension tables.  As we 
know, after applying predicate, the relevant records to participate a join may 
be a small subset of the initial table.  Hence initial table size is not a 
reliable way to decide a star schema.  
 
While this PR can show promising performance gain on most tpc-ds benchmark 
queries as tpc-ds has a well-know star schema, but can this approach still work 
in real world applications which do not clearly define a star schema? 
 
Therefore, I suggest that [SPARK-17791] and [SPARK-17626] should be tightly 
integrated with CBO project.  We are set to release CBO in Spark 2.2.  With the 
progress made so far, we can begin integrating star join reordering in December.

The same comment also applies to [SPARK-17626].

My two cents.  Thanks.

> Join reordering using star schema detection
> -------------------------------------------
>
>                 Key: SPARK-17791
>                 URL: https://issues.apache.org/jira/browse/SPARK-17791
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Priority: Critical
>         Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to