[jira] [Commented] (HIVE-8841) Make RDD caching work for multi-insert after HIVE-8793 when map join is involved [Spark Branch]

Xuefu Zhang (JIRA) Thu, 13 Nov 2014 06:58:29 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209856#comment-14209856
 ]


Xuefu Zhang commented on HIVE-8841:
-----------------------------------

Patch looks good. +1
{quote}
 do you know how I can verify this?
{quote}
Currently we can only manually verify this. For instance, if a MapWork is split 
and thus cloned, we should read from the source only once. Similar for a 
ReduceWork, the shuffle before the ReduceWork should only happen once.

I'm going to create a JIRA to visualize the Spark plan so that we know where 
caching is turned on.

> Make RDD caching work for multi-insert after HIVE-8793 when map join is 
> involved [Spark Branch]
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8841
>                 URL: https://issues.apache.org/jira/browse/HIVE-8841
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>         Attachments: HIVE-8841.1-spark.patch
>
>
> Splitting SparkWork now happens before MapJoinResolver. As MapJoinResolve may 
> further spins off a dependent SparkWork for small tables of a join, we need 
> to make Spark RDD caching continue work even across SparkWorks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8841) Make RDD caching work for multi-insert after HIVE-8793 when map join is involved [Spark Branch]

Reply via email to