[ https://issues.apache.org/jira/browse/HIVE-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209856#comment-14209856 ]
Xuefu Zhang commented on HIVE-8841: ----------------------------------- Patch looks good. +1 {quote} do you know how I can verify this? {quote} Currently we can only manually verify this. For instance, if a MapWork is split and thus cloned, we should read from the source only once. Similar for a ReduceWork, the shuffle before the ReduceWork should only happen once. I'm going to create a JIRA to visualize the Spark plan so that we know where caching is turned on. > Make RDD caching work for multi-insert after HIVE-8793 when map join is > involved [Spark Branch] > ----------------------------------------------------------------------------------------------- > > Key: HIVE-8841 > URL: https://issues.apache.org/jira/browse/HIVE-8841 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Rui Li > Attachments: HIVE-8841.1-spark.patch > > > Splitting SparkWork now happens before MapJoinResolver. As MapJoinResolve may > further spins off a dependent SparkWork for small tables of a join, we need > to make Spark RDD caching continue work even across SparkWorks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)