Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Xuefu Zhang Fri, 05 Sep 2014 11:19:25 -0700


> On Sept. 5, 2014, 5:59 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 
> > 228
> > <https://reviews.apache.org/r/25394/diff/1/?file=680525#file680525line228>
> >
> >     What's the reason to remove this?
> 
> Chao Sun wrote:
>     This is an issue we encountered in HIVE-7870: with this line, 
> context.fileSinkSet will contain multiple duplicated fileSinks, which may 
> then generate duplicated Move/Merge tasks. It would be better to left it be 
> solved in that JIRA. However, comment out this line makes it easier since in 
> {{GenSparkUtils::processFileSink}} I don't need to consider those "fake" file 
> sinks - they should not be in the {{opToTaskTable}}.
>     
>     I can also keep this line and change some other places. It's not a big 
> issue.


Probably you can put comments or TODOs on this. Thanks for the explanation.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review52472
-----------------------------------------------------------


On Sept. 5, 2014, 6:18 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 5, 2014, 6:18 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert. When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 9c808d4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 5ddc16d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> 379a39c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 864965e 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 76fc290 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java
>  5fcaf64 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Reply via email to