[ https://issues.apache.org/jira/browse/HIVE-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176324#comment-17176324 ]
Sebastian Klemke commented on HIVE-24033: ----------------------------------------- Execution plan of the failing query is here: [^failing_query_plan.txt] joinreducededuplication optimizer logs for this query: {code:java} 2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main] correlation.ReduceSinkJoinDeDuplication: Set RS[21] to forward data 2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main] correlation.ReduceSinkJoinDeDuplication: Set RS[20] to FIXED parallelism: 120 2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main] correlation.ReduceSinkJoinDeDuplication: Set RS[21] to FIXED parallelism: 120 2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main] correlation.ReduceSinkJoinDeDuplication: Set RS[17] to FIXED parallelism: 120 {code} > full outer join returns wrong number of results if > hive.optimize.joinreducededuplication is enabled > --------------------------------------------------------------------------------------------------- > > Key: HIVE-24033 > URL: https://issues.apache.org/jira/browse/HIVE-24033 > Project: Hive > Issue Type: Bug > Reporter: Sebastian Klemke > Priority: Major > Attachments: failing_query_plan.txt > > > We encountered a hive query that returns incorrect results, when joining two > CTEs on a group by value. The input tables `id_table` and > `reference_table` are unfortunately too large to share and on smaller tables > we have not been able to reproduce. > {code} > WITH ids AS ( > SELECT > record.id AS id > FROM > `id_table` > LATERAL VIEW explode(records) r AS record > WHERE > record.id = '5ef0bad74d325f72f0360c19' > LIMIT 1 > ), > refs AS ( > SELECT > reference['id'] AS referenceId > FROM > `reference_table` > WHERE > partition_date = '2020-06-24' > AND type = '1b0e9eb5c492d1859815410253dd79b5' > AND reference['id'] = '5ef0bad74d325f72f0360c19' > GROUP BY > reference['id'] > ) > SELECT > l.id AS id > , r.referenceId AS referenceId > FROM > ids l > FULL OUTER JOIN > refs r > ON > l.id = r.referenceId > {code} > This returns 2 rows, because the join clause misses: > {code} > OK > 5ef0bad74d325f72f0360c19 NULL > NULL 5ef0bad74d325f72f0360c19 > {code} > Instead, a single row should be returned. The correct behavior can be > achieved by either > * calling lower() on the refs group by statement (doesn't change the string > contents) > * setting hive.optimize.joinreducededuplication=false -- This message was sent by Atlassian Jira (v8.3.4#803005)