Hi, For the same query, for example,
select count(*), count(distinct mid) from db1.table1 where log_date between '2020-07-20' and '2020-07-26'; both Hive 2.3.2 and Hive 3.1.2 give different results for the same input. Note that db1.table1 is an ORC table and partitioned with the log_date column. Hive 2.3.2 +-----------+-----------+ | _c0 | _c1 | +-----------+-----------+ | 60040555 | 14723258 | +-----------+-----------+ Hive 3.1.2 +-----------+-----------+ | _c0 | _c1 | +-----------+-----------+ | 14723259 | 14723258 | +-----------+-----------+ Both plans are completely different, too. Hive 2.3.2 +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 2 | | File Output Operator [FS_7] | | Group By Operator [GBY_5] (rows=1 width=24) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] | | <-Map 1 [SIMPLE_EDGE] | | SHUFFLE [RS_4] | | Group By Operator [GBY_3] (rows=1429329075 width=100) | | Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid | | Select Operator [SEL_2] (rows=1429329075 width=100) | | Output:["mid"] | | TableScan [TS_0] (rows=1429329075 width=100) | | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | | | +----------------------------------------------------+ Hive 3.1.2 +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 2 | | File Output Operator [FS_7] | | Group By Operator [GBY_14] (rows=1 width=16) | | Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | | Group By Operator [GBY_11] (rows=343640771 width=4160) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] | | SHUFFLE [RS_10] | | PartitionCols:_col0 | | Group By Operator [GBY_9] (rows=343640771 width=4160) | | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | | Select Operator [SEL_2] (rows=343640771 width=4160) | | Output:["mid"] | | TableScan [TS_0] (rows=343640771 width=4160) | | japan_line_search_refine_log@daily_kpi_log,daily_kpi_log,Tbl:COMPLETE,Col:NONE,Output:["mid"] | | | +----------------------------------------------------+ Any guess on the reason why they are different? Best regards, Eugene Chung (Korean : 정의근)