Dustin Cote created HIVE-13019: ---------------------------------- Summary: Optimizer COLLECT_LIST/COLLECT_SET Key: HIVE-13019 URL: https://issues.apache.org/jira/browse/HIVE-13019 Project: Hive Issue Type: Improvement Reporter: Dustin Cote Priority: Minor
Currently when using a COLLECT_SET/COLLECT_LIST that involves data from a single table, the aggregation is done after any JOIN operation that is present in the query. For example: {code} insert into table nested_customers_orders select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...)) from customers c inner join orders o on (c.cid = o.oid) group by o.oid, o.date,... {code} If we can tell the optimizer to perform the COLLECT_LIST first (where possible) we can see some performance gains in this pattern of query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)