[ https://issues.apache.org/jira/browse/HIVE-23903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mahesh kumar behera reassigned HIVE-23903: ------------------------------------------ > Support "not-in" for bloom filter > --------------------------------- > > Key: HIVE-23903 > URL: https://issues.apache.org/jira/browse/HIVE-23903 > Project: Hive > Issue Type: Bug > Reporter: mahesh kumar behera > Assignee: mahesh kumar behera > Priority: Major > Labels: pull-request-available > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)