[ 
https://issues.apache.org/jira/browse/HIVE-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082045#comment-14082045
 ] 

Xuefu Zhang edited comment on HIVE-7503 at 8/5/14 9:15 PM:
-----------------------------------------------------------

As it's unlikely to get SPARK-2688 in a short period of time and since 
HIVE-7525 has shown that we can submit Spark job concurrently, I'd like to 
propose the following backup plan:

1. The multi-insert plan can be decomposed into 1 + N plans, where N is the 
number of inserts.
2. The 1 plan is the one generating the data source for all the inserts. The 
plan may not be necessary if the source is a table, but in general, the source 
coming from a job.
3. For each insert, there will be one plan to insert the data. Thus, N inserts 
equals to N plans. The source to the plans is the data generated from the 1 job.
4. We first run the 1 plan as a Spark job, which emits the data source.
5. Then, we call checkpoint on the rdd from #4.
6. Lastly, we launch N jobs concurrently, each with the above checkpointed rdd 
as input. 

While not ideal and probably not efficient, this approach should performs 
better than running 1 + N job sequentially.

The idea was from [~sowen], and verified by [~csun] via HIVE-7525.

[~rxin], what's your thought on this approach? Do you have any other 
suggestions?



was (Author: xuefuz):
As it's unlikely to get SPARK-2688 in a short period of time and since 
HIVE-7525 has shown that we can submit Spark job concurrently, I'd like to 
propose the following backup plan:

1. The multi-insert plan can be decomposed into 1 + N plans, where N is the 
number of inserts.
2. The 1 plan is the one generating the data source for all the inserts. The 
plan may not be necessary if the source is a table, but in general, the source 
coming from a job.
3. For each insert, there will be one plan to insert the data. Thus, N inserts 
equals to N plans. The source to the plans is the data generated from the 1 job.
4. We first run the 1 plan as a Spark job, which emits the data source.
5. Then, we cache the data source via RDD.cache().
6. Lastly, we launch N jobs concurrently, each with the above cached RDD as 
input. 

While not ideal and probably not efficient, this approach should performs 
better than running 1 + N job sequentially.

The idea was from [~sowen], and verified by [~csun] via HIVE-7525.

[~rxin], what's your thought on this approach? Do you have any other 
suggestions?


> Support Hive's multi-table insert query with Spark
> --------------------------------------------------
>
>                 Key: HIVE-7503
>                 URL: https://issues.apache.org/jira/browse/HIVE-7503
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Chao
>
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert.  When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to