[ https://issues.apache.org/jira/browse/HIVE-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234767#comment-14234767 ]
Szehon Ho commented on HIVE-9007: --------------------------------- I'll leave this JIRA for now. One observation to note here is that it is revealed in ppd_join4.q test, if you add "set hive.auto.convert.join=true" for the test. The plan has too many HashTableSinks. {noformat} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Spark #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: test_tbl Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Filter Operator predicate: ((id is not null and (name = 'c')) and (id = 'a')) (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Spark HashTable Sink Operator condition expressions: 0 1 keys: 0 'a' (type: string) 1 'a' (type: string) Local Work: Map Reduce Local Work Map 2 Map Operator Tree: TableScan alias: t3 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Filter Operator predicate: (id = 'a') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Spark HashTable Sink Operator condition expressions: 0 1 keys: 0 'a' (type: string) 1 'a' (type: string) Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} It could be related to this issue. I'll come back to this JIRA at a later point, or others who are free can take it. > Hive may generate wrong plan for map join queries due to > IdentityProjectRemover [Spark Branch] > ---------------------------------------------------------------------------------------------- > > Key: HIVE-9007 > URL: https://issues.apache.org/jira/browse/HIVE-9007 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Chao > Assignee: Szehon Ho > > HIVE-8435 introduces a new logical optimizer called IdentityProjectRemover, > which may cause map join in spark branch to generate wrong plan. > Currently, the map join conversion in spark branch first goes through a > method {{convertJoinMapJoin}}, which replaces a join op with a mapjoin op, > removes RS associated with big table, and keep RSs for all small tables. > Afterwards, in {{SparkReduceSinkMapJoinProc}} it replaces all parent RSs of > the mapjoin op with HTS (note it doesn't check whether the RS belongs to > small table or big table.) > The issue arises, when IdentityProjectRemover comes into play, which may > result into a situation that a operator tree has two consecutive RSs. Imaging > the following example: > {noformat} > Join MapJoin > / \ / \ > RS RS ---> RS RS > / \ / \ > TS RS TS TS (big table) > \ (small table) > TS > {noformat} > In this case, all parents of the mapjoin op will be RS, even the branch for > big table! In {{SparkReduceSinkMapJoinProc}}, they will be replaced with HTS, > which is obviously incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)