Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Xuefu Zhang Fri, 19 Sep 2014 13:14:33 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review54004
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93870>

    I was thinking that you have all the paths at hand, now you just keep 
moving up all branches up together and then checking if lca is hit. I didn't 
realize we do this while we are still traversing the tree.
    
    I have to admit that I don't quite get the whole logic here. Does the lca 
change before all TS is visited?



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93876>

    Correct me if I'm wrong. For the whole graph, we at most find one LCA to 
split the plan, right? Also, in no way, th LCA can be an FORWARD, right? But 
there can be mutliple FORARDs, which can have a common ancestor, which might be 
a point to split.
    
    Again, I don't quite understand how lca is identified while we are still 
visiting the tree. But I'm sure that we don't want to create more spark jobs 
then needed. If we don't do better than MR when we could, then the meaning of 
the project would be greatly compromised.



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93869>

    Okay. Fair enough.


- Xuefu Zhang


On Sept. 18, 2014, 6:38 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 18, 2014, 6:38 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert. When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 4211a0703f5b6bfd8a628b13864fac75ef4977cf 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> 695d8b90cb1989805a7ff4e39a9635bbcea9c66c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 
> 864965e03a3f9d665e21e1c1b10b19dc286b842f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 76fc290f00430dbc34dbbc1a0cef0d0eb59e6029 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java
>  5fcaf643a0e90fc4acc21187f6d78cefdb1b691a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Reply via email to