[ https://issues.apache.org/jira/browse/HIVE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang reassigned HIVE-7958: --------------------------------- Assignee: Xuefu Zhang > SparkWork generated by SparkCompiler may require multiple Spark jobs to run > --------------------------------------------------------------------------- > > Key: HIVE-7958 > URL: https://issues.apache.org/jira/browse/HIVE-7958 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Priority: Critical > Labels: Spark-M1 > > A SparkWork instance currently may contain disjointed work graphs. For > instance, union_remove_1.q may generated a plan like this: > {code} > Reduce2 <- Map 1 > Reduce4 <- Map 3 > {code} > The SparkPlan instance generated from this work graph contains two result > RDDs. When such plan is executed, we call .foreach() on the two RDDs > sequentially, which results two Spark jobs, one after the other. > While this works functionally, the performance will not be great as the Spark > jobs are run sequentially rather than concurrently. > Another side effect of this is that the corresponding SparkPlan instance is > over-complicated. > The are two potential approaches: > 1. Let SparkCompiler generate a work that can be executed in ONE Spark job > only. In above example, two Spark task should be generated. > 2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient > executes them concurrently. > Approach #1 seems more reasonable and naturally fit to our architecture. > Also, Hive's task execution framework already takes care of the task > concurrency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)