Using Cluster by to improve Group by Performance

KayVajj Mon, 28 Oct 2013 13:19:00 -0700

Hi,

I have a question if I could use the cluster by clause in a sub query to
improve the performance of a group by query in hive


Lets I have a Table A with columns (all strings) col1..col5 and the table
is not "Clustered"

now I 'm trying to run the below query

select
> col1,
> col2,
> col3,
> col4,
> concat_ws(',', collect_set(col5))
> from A
> group by
> col1,
> col2,
> col3,
> col4



Would the below query optimize the above query and if not what is the best
practice to optimize this query. Assuming only col1 & col2 are the uniquely
identifying columns




select
> ct.col1,
> ct.col2,
> ct.col3,
> ct.col4,
> concat_ws(',', collect_set(ct.col5))
> from
> (
> select
> col1,
> col2,
> col3,
> col4,
> col5
> from A
> cluster by col1, col2
> ) ct
> group by
> ct.col1,
> ct.col2,
> ct.col3,
> ct.col4.


Thanks for your responses.

Using Cluster by to improve Group by Performance

Reply via email to