Hi, They are rigorously equivalent.
You can see this with the following queries: CREATE TABLE t1 (a INT, b INT, c INT) ; EXPLAIN SELECT DISTINCT a,b,c FROM t1 ; EXPLAIN SELECT a,b,c FROM t1 GROUP BY a,b,c ; Both queries will return the exact same query plan: Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 vectorized File Output Operator [FS_8] compressed:false Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [OP_7] | keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int) | outputColumnNames:["_col0","_col1","_col2"] | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_3] key expressions:_col0 (type: int), _col1 (type: int), _col2 (type: int) Map-reduce partition columns:_col0 (type: int), _col1 (type: int), _col2 (type: int) sort order:+++ Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Group By Operator [GBY_2] keys:a (type: int), b (type: int), c (type: int) outputColumnNames:["_col0","_col1","_col2"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator [SEL_1] outputColumnNames:["a","b","c"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE TableScan [TS_0] alias:t1 Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE However, these two queries are NOT equivalent: SELECT COUNT(DISTINCT a,b,c) FROM t1 ; SELECT COUNT(1) FROM ( SELECT a,b,c FROM t1 GROUP BY a,b,c ) T ; In general, the first one is faster except that it can fail if Hive optimize it poorly, while the second one is slower but more reliable. Also, most importantly, they don't give the same results as COUNT(DISTINCT a, b, c) will ignore any row where a, b or c is null. Their respective query plans are : Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_6] compressed:false Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [GBY_4] | aggregations:["count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2)"] | outputColumnNames:["_col0"] | Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_3] key expressions:_col0 (type: int), _col1 (type: int), _col2 (type: int) sort order:+++ Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Group By Operator [GBY_2] aggregations:["count(DISTINCT a, b, c)"] keys:a (type: int), b (type: int), c (type: int) outputColumnNames:["_col0","_col1","_col2","_col3"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator [SEL_1] outputColumnNames:["a","b","c"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE TableScan [TS_0] alias:t1 Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE and Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 3 vectorized File Output Operator [FS_13] compressed:false Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [OP_12] | aggregations:["count(VALUE._col0)"] | outputColumnNames:["_col0"] | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |<-Reducer 2 [SIMPLE_EDGE] Reduce Output Operator [RS_8] sort order: Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions:_col0 (type: bigint) Group By Operator [GBY_7] aggregations:["count(1)"] outputColumnNames:["_col0"] Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Select Operator [SEL_5] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Group By Operator [GBY_4] | keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int) | outputColumnNames:["_col0","_col1","_col2"] | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_3] key expressions:_col0 (type: int), _col1 (type: int), _col2 (type: int) Map-reduce partition columns:_col0 (type: int), _col1 (type: int), _col2 (type: int) sort order:+++ Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Group By Operator [GBY_2] keys:a (type: int), b (type: int), c (type: int) outputColumnNames:["_col0","_col1","_col2"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator [SEL_1] outputColumnNames:["a","b","c"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE TableScan [TS_0] alias:t1 Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE On Mon, 2 Jul 2018 at 10:03, 孙志禹 <anci_...@yahoo.com> wrote: > Dear all, > Does the code1 below have a better efficiency than code2? > Thanks! > -------------------------------------------- > CODE1: > select > distinct a,b,c > from table1 > > CODE2: > select > a,b,c > from table1 > group by > a,b,c >