[ 
https://issues.apache.org/jira/browse/SPARK-17375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17375.
-------------------------------
    Resolution: Duplicate

Marking this as duplicate of SPARK-17626

> Star Join Optimization
> ----------------------
>
>                 Key: SPARK-17375
>                 URL: https://issues.apache.org/jira/browse/SPARK-17375
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Yan
>
> The star schema is the simplest style of data mart schema and is the approach 
> often seen in BI/Decision Support systems. Star Join is a popular SQL query 
> pattern that joins one or (a few) fact tables with a few dimension tables in 
> star schemas. Star Join Query Optimizations aim to optimize the performance 
> and use of resource for the star joins.
> Currently the existing Spark SQL optimization works on broadcasting the 
> usually small (after filtering and projection) dimension tables to avoid 
> costly shuffling of fact table and the "reduce" operations based on the join 
> keys.
> This improvement proposal tries to further improve the broadcast star joins 
> in the two areas:
> 1) avoid materialization of the intermediate rows that otherwise could 
> eventually not make to the final result row set after further joined with 
> other dimensions that are more restricting;
> 2) avoid the performance variations among different join orders. This could 
> also have been largely achieved by cost analysis and heuristics and selecting 
> a reasonably optimal join order. But we are here trying to achieve similar 
> improvement without relying on such info.
> A preliminary test against a small TPCDS 1GB data set indicates between 
> 5%-40% improvement (with codegen disabled on both tests) vs. the multiple 
> broadcast joins on one Query (Q27) that inner joins 4 dimension table with 
> one fact table. The large variation (5%-40%) is due to  the different join 
> ordering of the 4 broadcast joins. Tests using larger data sets and other 
> TPCDS queries are yet to be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to