Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Xuefu Zhang Sun, 09 Nov 2014 07:06:37 -0800


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
> >  line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each 
> > other, which isn't true in general.Let's say the are two works (w1 and w2), 
> > each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
> > also contains map join operator. Dependency in this scenario will be 
> > graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, 
> they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
> will depends on both the SparkWork generated at w1 and SparkWork generated at 
> w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to 
> create a SparkWork to handle its small tables. So, both w1 and w2 will need 
> to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
> SparkWork depends on the two SparkWorks created.
> 
> Chao Sun wrote:
>     I'm not getting it, why "This dependency is not linear"? Can you give a 
> counter example?
>     Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
>     
>          HTS_1   HTS_2     HTS_3    HTS_4
>            \      /           \     /
>             \    /             \   /
>               MJ_1              MJ_2
>                |                 |
>                |                 |
>               HTS_5            HTS_6
>                   \            /
>                    \          /
>                     \        /
>                      \      /
>                       \    /
>                         MJ_3
>                         
>     Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
> SparkWork, say SW_1
>     then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and 
> MJ_3 in another SparkWork SW_3.
>     SW_1 -> SW_2 -> SW_3.


I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same SparkWork. 
They belong to different MJ handling different sets of small tables. This will 
complicate things, making HashTableSinkOperator and HashTableLoader more 
complicated.

Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, and 
vice versa.

Please refer to pseudo code posted in the JIRA for implementation ideas. Thanks.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Reply via email to