[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235085#comment-16235085
 ] 

Xuefu Zhang commented on HIVE-17486:
------------------------------------

Hi [~kellyzly], I think your observation is correct. Spark has certain 
limitations. In fact, the edge theory doesn't even apply to Spark. Spark uses 
RDD model. Internally Hive translates the DAG to RDD operations 
(transformations and actions). In the example of ( Map1->Reducer3, 
Map1->Reducer2), Hive on Spark actually has a plan like (map12 - > reduce2, 
map13 ->reduce3) with map12 = map13. This way, there will be two spark jobs. In 
the second job, the cached result is used instead of loading the data again. 
BTW, this is a multi-insert example.

Multiple edges between two vertices are even less thinkable. You might be able 
to turn this optimization for Spark, but Spark might not be able to run it. I'm 
not sure if there is any case that this optimization might help Spark. My gut 
feeling is that this needs to be combined with Spark RDD caching or HIve's 
materialized view.

> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>            Priority: Major
>         Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to