[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237271#comment-16237271 ]
liyunzhang commented on HIVE-17486: ----------------------------------- [~lirui]: {quote} I also think that's possible in theory. But I guess it will require lots of work. E.g. we may need to modify MapOperator to accommodate the new M->M->R scheme {quote} now i am working on changing from {{M->R}} to {{M->M->R}} schema. Not very clear about the modification on MapOperator. If you know, please say more detailed. I think at first need change {{GenSparkWork}} to split the physical operator trees once en-counting one TS has more than 1 child. For example physical plan {code} TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48] {code} As TS\[0\] has two children(FIL\[52\], FIL\[53\]). First split at TS\[0\] and bring it to Map1, then split following operator trees when en counting RS. So the final operator tree will be {code} Map1: TS[0] Map2:FIL[52]-SEL[2]-GBY[3]-RS[4] Map3:FIL[53]-SEL[9]-GBY[10]-RS[11] Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] Reducer2:GBY[12]-RS[43] {code} This is very initial thinking. If have suggestion, please tell me, thanks! > Enable SharedWorkOptimizer in tez on HOS > ---------------------------------------- > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang > Assignee: liyunzhang > Priority: Major > Attachments: scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)