[ 
https://issues.apache.org/jira/browse/HIVE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-7958:
---------------------------------

    Assignee: Xuefu Zhang

> SparkWork generated by SparkCompiler may require multiple Spark jobs to run
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-7958
>                 URL: https://issues.apache.org/jira/browse/HIVE-7958
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>            Priority: Critical
>              Labels: Spark-M1
>
> A SparkWork instance currently may contain disjointed work graphs. For 
> instance, union_remove_1.q may generated a plan like this:
> {code}
> Reduce2 <- Map 1
> Reduce4 <- Map 3
> {code}
> The SparkPlan instance generated from this work graph contains two result 
> RDDs. When such plan is executed, we call .foreach() on the two RDDs 
> sequentially, which results two Spark jobs, one after the other.
> While this works functionally, the performance will not be great as the Spark 
> jobs are run sequentially rather than concurrently.
> Another side effect of this is that the corresponding SparkPlan instance is 
> over-complicated.
> The are two potential approaches:
> 1. Let SparkCompiler generate a work that can be executed in ONE Spark job 
> only. In above example, two Spark task should be generated.
> 2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient 
> executes them concurrently.
> Approach #1 seems more reasonable and naturally fit to our architecture. 
> Also, Hive's task execution framework already takes care of the task 
> concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to