[ https://issues.apache.org/jira/browse/HIVE-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082045#comment-14082045 ]
Xuefu Zhang edited comment on HIVE-7503 at 8/5/14 9:15 PM: ----------------------------------------------------------- As it's unlikely to get SPARK-2688 in a short period of time and since HIVE-7525 has shown that we can submit Spark job concurrently, I'd like to propose the following backup plan: 1. The multi-insert plan can be decomposed into 1 + N plans, where N is the number of inserts. 2. The 1 plan is the one generating the data source for all the inserts. The plan may not be necessary if the source is a table, but in general, the source coming from a job. 3. For each insert, there will be one plan to insert the data. Thus, N inserts equals to N plans. The source to the plans is the data generated from the 1 job. 4. We first run the 1 plan as a Spark job, which emits the data source. 5. Then, we call checkpoint on the rdd from #4. 6. Lastly, we launch N jobs concurrently, each with the above checkpointed rdd as input. While not ideal and probably not efficient, this approach should performs better than running 1 + N job sequentially. The idea was from [~sowen], and verified by [~csun] via HIVE-7525. [~rxin], what's your thought on this approach? Do you have any other suggestions? was (Author: xuefuz): As it's unlikely to get SPARK-2688 in a short period of time and since HIVE-7525 has shown that we can submit Spark job concurrently, I'd like to propose the following backup plan: 1. The multi-insert plan can be decomposed into 1 + N plans, where N is the number of inserts. 2. The 1 plan is the one generating the data source for all the inserts. The plan may not be necessary if the source is a table, but in general, the source coming from a job. 3. For each insert, there will be one plan to insert the data. Thus, N inserts equals to N plans. The source to the plans is the data generated from the 1 job. 4. We first run the 1 plan as a Spark job, which emits the data source. 5. Then, we cache the data source via RDD.cache(). 6. Lastly, we launch N jobs concurrently, each with the above cached RDD as input. While not ideal and probably not efficient, this approach should performs better than running 1 + N job sequentially. The idea was from [~sowen], and verified by [~csun] via HIVE-7525. [~rxin], what's your thought on this approach? Do you have any other suggestions? > Support Hive's multi-table insert query with Spark > -------------------------------------------------- > > Key: HIVE-7503 > URL: https://issues.apache.org/jira/browse/HIVE-7503 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Chao > > For Hive's multi insert query > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there > may be an MR job for each insert. When we achieve this with Spark, it would > be nice if all the inserts can happen concurrently. > It seems that this functionality isn't available in Spark. To make things > worse, the source of the insert may be re-computed unless it's staged. Even > with this, the inserts will happen sequentially, making the performance > suffer. > This task is to find out what takes in Spark to enable this without requiring > staging the source and sequential insertion. If this has to be solved in > Hive, find out an optimum way to do this. -- This message was sent by Atlassian JIRA (v6.2#6252)