[ https://issues.apache.org/jira/browse/HIVE-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated HIVE-23880: -------------------------------- Attachment: lipwig-output3605036885489193068.svg > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --------------------------------------------------------------------------- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: lipwig-output3605036885489193068.svg > > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000) and a large amount of > expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 70000000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk > and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters, as in: -- This message was sent by Atlassian Jira (v8.3.4#803005)