[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292286#comment-16292286 ]
liyunzhang commented on HIVE-17486: ----------------------------------- [~xuefuz]: before you mentioned that the reason to disable caching for MapInput because of [IOContext initialization problem|https://issues.apache.org/jira/browse/HIVE-8920?focusedCommentId=14260846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14260846]. And reading HIVE-9041, it shows an example to show IOContext initialization problem {code} I just found another bug regarding IOContext, when caching is turned on. Taking the sample query above as example, right now I have this result plan: MW 1 (table0) MW 2 (table1) MW 3 (table0) MW 4 (table1) \ / \ / \ / \ / \ / \ / \ / \ / RW 1 RW 2 Suppose MapWorks are executed from left to right, also suppose we are just running with a single thread. Then, the following will happen: 1. executing MW 1: since this is the first time we access table0, initialize IOContext and make input path point to table0; 2. executing MW 2: since this is the first time we access table1, initialize IOContext and make input path point to table1; 3. executing MW 3: since this is the second time access table0, do not initialize IOContext, and use the copy saved in step 2), which is table1. Step 3 will then fail. how to make MW 3 know that it needs to get the saved IOContext from MW 1, but not MW 2 {code} If the problem exists in the MapInput RDD cache because of IOContext is a static variable which will be stored in cache and the IOContext will be updated in different Maps. Why only disable in [MapInput rdd cache|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202]? This should be disabled in all MapTrans. Please explain more if have time. > Enable SharedWorkOptimizer in tez on HOS > ---------------------------------------- > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang > Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > explain.28.share.false, explain.28.share.true, scanshare.after.svg, > scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)