[ https://issues.apache.org/jira/browse/HIVE-28223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhaojk updated HIVE-28223: -------------------------- Description: When the following four conditions are met during the SQL runtime, the result may be abnormal. The first condition is that there is no mapjoin association. The second condition is that there is a group by in the first paragraph of SQL logic. The third condition is that there is a row number over partition windowing function in the second paragraph of SQL. The fourth condition is that the association condition is of type int or bigint. When these four conditions are met simultaneously, the result will be missing data. If one section of SQL is changed to a subtable, the result is correct regardless of whether mapjoin is enabled or not, and converting the association condition to a string result is also correct. Detailed SQL and execution plan 大表和小表的SQL运行过程中,满足以下四个条件的时候导致结果异常,第一点没有走mapjoin关联,第二点第一段SQL逻辑中有group by,第三点第二段SQL中有row number over partition的开窗函数,第四点关联条件为int或者bigint类型,同时满足这四个条件的时候结果就会缺失数据。 如果将其中一段sql改为子表,不管是否开启mapjoin结果都是正确的,将关联条件转换为string结果也是正确的,详细SQL及执行计划 hive 3.1.0 TEZ0.9.1 关闭mapjoin,结果异常 Close mapjoin, result is abnormal set hive.auto.convert.join=false; select count(1) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id where c.rank_no = 1; +--------+ | _c0 | +--------+ | 67594 | +--------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) | | Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 5 vectorized | | File Output Operator [FS_43] | | Group By Operator [GBY_42] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 4 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_21] | | Group By Operator [GBY_20] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | | Conds:GBY_5._col0=SEL_12._col0(Inner) | | <-Group By Operator [GBY_5] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Select Operator [SEL_12] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_28] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_11] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_10] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_39] | | PartitionCols:_col0 | | Group By Operator [GBY_38] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_37] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | <-Map 3 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_41] | | PartitionCols:user_id | | Filter Operator [FIL_40] (rows=1800617 width=120) | | predicate ((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ 开启mapjoin,结果正确 Enable mapjoin, the result is correct set hive.auto.convert.join=true; select count(1) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id where c.rank_no = 1; +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) | | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | | Reducer 5 <- Map 4 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 3 vectorized | | File Output Operator [FS_52] | | Group By Operator [GBY_51] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized | | PARTITION_ONLY_SHUFFLE [RS_50] | | Group By Operator [GBY_49] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) | | Conds:GBY_47._col0=RS_46._col0(Inner) | | <-Reducer 5 [BROADCAST_EDGE] vectorized | | BROADCAST [RS_46] | | PartitionCols:_col0 | | Select Operator [SEL_45] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_44] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_43] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_42] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 4 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_41] | | PartitionCols:user_id | | Filter Operator [FIL_40] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | <-Group By Operator [GBY_47] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_39] | | PartitionCols:_col0 | | Group By Operator [GBY_38] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_37] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | | +----------------------------------------------------+ --关联条件cast为string,结果正确 The association condition cast is string, and the result is correct set hive.auto.convert.join=false; select count(1) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on cast(a.user_id as string) = cast(c.user_id as string) where c.rank_no = 1; +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) | | Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE) | | Reducer 6 <- Map 5 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 4 vectorized | | File Output Operator [FS_49] | | Group By Operator [GBY_48] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 3 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_21] | | Group By Operator [GBY_20] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | | Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS STRING)(Inner) | | <-Reducer 2 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_40] | | PartitionCols:CAST( _col0 AS STRING) | | Group By Operator [GBY_39] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_38] | | PartitionCols:_col0 | | Group By Operator [GBY_37] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_36] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | <-Reducer 6 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_47] | | PartitionCols:CAST( _col0 AS STRING) | | Select Operator [SEL_46] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_45] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_44] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_43] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 5 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_42] | | PartitionCols:user_id | | Filter Operator [FIL_41] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ ====== 将关联表转化为子表,不用group by,结果也是正确的 Convert the associated table to a subtable without using group by, and the result is also correct set hive.auto.convert.join=false; select count(1) from ( select user_id from tmp_dev.test11 ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id; where c.rank_no = 1 +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) | | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 3 vectorized | | File Output Operator [FS_37] | | Group By Operator [GBY_36] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_11] | | Group By Operator [GBY_10] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) | | Conds:RS_32._col0=RS_35._col0(Inner) | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_32] | | PartitionCols:_col0 | | Select Operator [SEL_31] (rows=170303 width=8) | | Output:["_col0"] | | Filter Operator [FIL_30] (rows=170303 width=8) | | predicate:user_id is not null | | TableScan [TS_0] (rows=170303 width=8) | | tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] | | <-Map 4 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_35] | | PartitionCols:_col0 | | Select Operator [SEL_34] (rows=1800617 width=8) | | Output:["_col0"] | | Filter Operator [FIL_33] (rows=1800617 width=120) | | predicate((points_balance <> 0) and user_id is not null) | | TableScan [TS_3] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ was: When the following four conditions are met during the SQL runtime, the result may be abnormal. The first condition is that there is no mapjoin association. The second condition is that there is a group by in the first paragraph of SQL logic. The third condition is that there is a row number over partition windowing function in the second paragraph of SQL. The fourth condition is that the association condition is of type int or bigint. When these four conditions are met simultaneously, the result will be missing data. If one section of SQL is changed to a subtable, the result is correct regardless of whether mapjoin is enabled or not, and converting the association condition to a string result is also correct. Detailed SQL and execution plan 大表和小表的SQL运行过程中,满足以下四个条件的时候导致结果异常,第一点没有走mapjoin关联,第二点第一段SQL逻辑中有group by,第三点第二段SQL中有row number over partition的开窗函数,第四点关联条件为int或者bigint类型,同时满足这四个条件的时候结果就会缺失数据。 如果将其中一段sql改为子表,不管是否开启mapjoin结果都是正确的,将关联条件转换为string结果也是正确的,详细SQL及执行计划 hive 3.1.0 TEZ0.9.1 关闭mapjoin,结果异常 Close mapjoin, result is abnormal set hive.auto.convert.join=false; select count(*) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id where c.rank_no = 1; +--------+ | _c0 | +--------+ | 67594 | +--------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) | | Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 5 vectorized | | File Output Operator [FS_43] | | Group By Operator [GBY_42] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 4 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_21] | | Group By Operator [GBY_20] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | | Conds:GBY_5._col0=SEL_12._col0(Inner) | | <-Group By Operator [GBY_5] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Select Operator [SEL_12] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_28] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_11] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_10] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_39] | | PartitionCols:_col0 | | Group By Operator [GBY_38] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_37] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | <-Map 3 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_41] | | PartitionCols:user_id | | Filter Operator [FIL_40] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ 开启mapjoin,结果正确 Enable mapjoin, the result is correct set hive.auto.convert.join=true; select count(*) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id where c.rank_no = 1; +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) | | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | | Reducer 5 <- Map 4 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 3 vectorized | | File Output Operator [FS_52] | | Group By Operator [GBY_51] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized | | PARTITION_ONLY_SHUFFLE [RS_50] | | Group By Operator [GBY_49] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) | | Conds:GBY_47._col0=RS_46._col0(Inner) | | <-Reducer 5 [BROADCAST_EDGE] vectorized | | BROADCAST [RS_46] | | PartitionCols:_col0 | | Select Operator [SEL_45] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_44] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_43] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_42] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 4 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_41] | | PartitionCols:user_id | | Filter Operator [FIL_40] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | <-Group By Operator [GBY_47] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_39] | | PartitionCols:_col0 | | Group By Operator [GBY_38] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_37] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | | +----------------------------------------------------+ --关联条件cast为string,结果正确 The association condition cast is string, and the result is correct set hive.auto.convert.join=false; select count(*) from ( select user_id from odb.test0000 --10524428 where etl_dt = '20240329' group by user_id ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on cast(a.user_id as string) = cast(c.user_id as string) where c.rank_no = 1; +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) | | Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE) | | Reducer 6 <- Map 5 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 4 vectorized | | File Output Operator [FS_49] | | Group By Operator [GBY_48] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 3 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_21] | | Group By Operator [GBY_20] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | | Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS STRING)(Inner) | | <-Reducer 2 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_40] | | PartitionCols:CAST( _col0 AS STRING) | | Group By Operator [GBY_39] (rows=1613465 width=2359) | | Output:["_col0"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_38] | | PartitionCols:_col0 | | Group By Operator [GBY_37] (rows=3226931 width=2359) | | Output:["_col0"],keys:user_id | | Filter Operator [FIL_36] (rows=3226931 width=2359) | | predicate:user_id is not null | | TableScan [TS_0] (rows=3396769 width=2359) | | odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | | <-Reducer 6 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_47] | | PartitionCols:CAST( _col0 AS STRING) | | Select Operator [SEL_46] (rows=900308 width=8) | | Output:["_col0"] | | Filter Operator [FIL_45] (rows=900308 width=572) | | predicate:(row_number_window_0 = 1) | | PTF Operator [PTF_44] (rows=1800617 width=572) | | Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS LAST","partition by:":"_col2"}] | | Select Operator [SEL_43] (rows=1800617 width=572) | | Output:["_col2","_col4"] | | <-Map 5 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_42] | | PartitionCols:user_id | | Filter Operator [FIL_41] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_7] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ ====== 将关联表转化为子表,不用group by,结果也是正确的 Convert the associated table to a subtable without using group by, and the result is also correct set hive.auto.convert.join=false; select count(*) from ( select user_id from tmp_dev.test11 ) a JOIN ( select user_id, points_balance, row_number() over( partition by user_id order by points_balance desc ) as rank_no from odb.test1111 --1800617 where etl_dt = '20240329' and points_balance <> 0 ) c on a.user_id = c.user_id; where c.rank_no = 1 +---------+ | _c0 | +---------+ | 135417 | +---------+ +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) | | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 3 vectorized | | File Output Operator [FS_37] | | Group By Operator [GBY_36] (rows=1 width=8) | | Output:["_col0"],aggregations:["count(VALUE._col0)"] | | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] | | PARTITION_ONLY_SHUFFLE [RS_11] | | Group By Operator [GBY_10] (rows=1 width=8) | | Output:["_col0"],aggregations:["count()"] | | Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) | | Conds:RS_32._col0=RS_35._col0(Inner) | | <-Map 1 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_32] | | PartitionCols:_col0 | | Select Operator [SEL_31] (rows=170303 width=8) | | Output:["_col0"] | | Filter Operator [FIL_30] (rows=170303 width=8) | | predicate:user_id is not null | | TableScan [TS_0] (rows=170303 width=8) | | tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] | | <-Map 4 [SIMPLE_EDGE] vectorized | | SHUFFLE [RS_35] | | PartitionCols:_col0 | | Select Operator [SEL_34] (rows=1800617 width=8) | | Output:["_col0"] | | Filter Operator [FIL_33] (rows=1800617 width=120) | | predicate:((points_balance <> 0) and user_id is not null) | | TableScan [TS_3] (rows=1800617 width=120) | | odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] | | | +----------------------------------------------------+ > Hive on tez: Association of large and small tables satisfies four conditions > simultaneously, resulting in data loss: 1. Close mapjoin, 2. Association has > group by, 3. Association has windowing function, and 4. Association condition > is int/bigint > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-28223 > URL: https://issues.apache.org/jira/browse/HIVE-28223 > Project: Hive > Issue Type: Bug > Affects Versions: 3.1.0 > Reporter: zhaojk > Priority: Major > > When the following four conditions are met during the SQL runtime, the result > may be abnormal. The first condition is that there is no mapjoin association. > The second condition is that there is a group by in the first paragraph of > SQL logic. The third condition is that there is a row number over partition > windowing function in the second paragraph of SQL. The fourth condition is > that the association condition is of type int or bigint. When these four > conditions are met simultaneously, the result will be missing data. > If one section of SQL is changed to a subtable, the result is correct > regardless of whether mapjoin is enabled or not, and converting the > association condition to a string result is also correct. Detailed SQL and > execution plan > 大表和小表的SQL运行过程中,满足以下四个条件的时候导致结果异常,第一点没有走mapjoin关联,第二点第一段SQL逻辑中有group > by,第三点第二段SQL中有row number over > partition的开窗函数,第四点关联条件为int或者bigint类型,同时满足这四个条件的时候结果就会缺失数据。 > 如果将其中一段sql改为子表,不管是否开启mapjoin结果都是正确的,将关联条件转换为string结果也是正确的,详细SQL及执行计划 > hive 3.1.0 TEZ0.9.1 > 关闭mapjoin,结果异常 > Close mapjoin, result is abnormal > set hive.auto.convert.join=false; > select > count(1) > from > ( > select > user_id > from > odb.test0000 --10524428 > where > etl_dt = '20240329' > group by > user_id > ) a > JOIN ( > select > user_id, > points_balance, > row_number() over( > partition by user_id > order by > points_balance desc > ) as rank_no > from > odb.test1111 --1800617 > where > etl_dt = '20240329' > and points_balance <> 0 > ) c on a.user_id = c.user_id > where c.rank_no = 1; > > +--------+ > | _c0 | > +--------+ > | 67594 | > +--------+ > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) | > | Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 5 vectorized | > | File Output Operator [FS_43] | > | Group By Operator [GBY_42] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count(VALUE._col0)"] | > | <-Reducer 4 [CUSTOM_SIMPLE_EDGE] | > | PARTITION_ONLY_SHUFFLE [RS_21] | > | Group By Operator [GBY_20] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count()"] | > | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | > | Conds:GBY_5._col0=SEL_12._col0(Inner) | > | <-Group By Operator [GBY_5] (rows=1613465 width=2359) | > | Output:["_col0"],keys:KEY._col0 | > | <-Select Operator [SEL_12] (rows=900308 width=8) | > | Output:["_col0"] | > | Filter Operator [FIL_28] (rows=900308 width=572) | > | predicate:(row_number_window_0 = 1) | > | PTF Operator [PTF_11] (rows=1800617 width=572) | > | Function > definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC > NULLS LAST","partition by:":"_col2"}] | > | Select Operator [SEL_10] (rows=1800617 width=572) | > | Output:["_col2","_col4"] | > | <-Map 1 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_39] | > | PartitionCols:_col0 | > | Group By Operator [GBY_38] (rows=3226931 > width=2359) | > | Output:["_col0"],keys:user_id | > | Filter Operator [FIL_37] (rows=3226931 > width=2359) | > | predicate:user_id is not null | > | TableScan [TS_0] (rows=3396769 width=2359) | > | > odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | > | <-Map 3 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_41] | > | PartitionCols:user_id | > | Filter Operator [FIL_40] (rows=1800617 width=120) > | > | predicate ((points_balance <> 0) and user_id is > not null) | > | TableScan [TS_7] (rows=1800617 width=120) | > | > odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] > | > | | > +----------------------------------------------------+ > 开启mapjoin,结果正确 > Enable mapjoin, the result is correct > set hive.auto.convert.join=true; > select > count(1) > from > ( > select > user_id > from > odb.test0000 --10524428 > where > etl_dt = '20240329' > group by > user_id > ) a > JOIN ( > select > user_id, > points_balance, > row_number() over( > partition by user_id > order by > points_balance desc > ) as rank_no > from > odb.test1111 --1800617 > where > etl_dt = '20240329' > and points_balance <> 0 > ) c on a.user_id = c.user_id > where c.rank_no = 1; > > +---------+ > | _c0 | > +---------+ > | 135417 | > +---------+ > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) | > | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | > | Reducer 5 <- Map 4 (SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 3 vectorized | > | File Output Operator [FS_52] | > | Group By Operator [GBY_51] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count(VALUE._col0)"] | > | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized | > | PARTITION_ONLY_SHUFFLE [RS_50] | > | Group By Operator [GBY_49] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count()"] | > | Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) | > | Conds:GBY_47._col0=RS_46._col0(Inner) | > | <-Reducer 5 [BROADCAST_EDGE] vectorized | > | BROADCAST [RS_46] | > | PartitionCols:_col0 | > | Select Operator [SEL_45] (rows=900308 width=8) | > | Output:["_col0"] | > | Filter Operator [FIL_44] (rows=900308 width=572) | > | predicate:(row_number_window_0 = 1) | > | PTF Operator [PTF_43] (rows=1800617 width=572) | > | Function > definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC > NULLS LAST","partition by:":"_col2"}] | > | Select Operator [SEL_42] (rows=1800617 width=572) | > | Output:["_col2","_col4"] | > | <-Map 4 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_41] | > | PartitionCols:user_id | > | Filter Operator [FIL_40] (rows=1800617 > width=120) | > | predicate:((points_balance <> 0) and user_id > is not null) | > | TableScan [TS_7] (rows=1800617 width=120) | > | > odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] > | > | <-Group By Operator [GBY_47] (rows=1613465 width=2359) | > | Output:["_col0"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_39] | > | PartitionCols:_col0 | > | Group By Operator [GBY_38] (rows=3226931 width=2359) | > | Output:["_col0"],keys:user_id | > | Filter Operator [FIL_37] (rows=3226931 width=2359) | > | predicate:user_id is not null | > | TableScan [TS_0] (rows=3396769 width=2359) | > | > odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | > | | > +----------------------------------------------------+ > --关联条件cast为string,结果正确 > The association condition cast is string, and the result is correct > set hive.auto.convert.join=false; > select > count(1) > from > ( > select > user_id > from > odb.test0000 --10524428 > where > etl_dt = '20240329' > group by > user_id > ) a > JOIN ( > select > user_id, > points_balance, > row_number() over( > partition by user_id > order by > points_balance desc > ) as rank_no > from > odb.test1111 --1800617 > where > etl_dt = '20240329' > and points_balance <> 0 > ) c on cast(a.user_id as string) = cast(c.user_id as string) > where c.rank_no = 1; > +---------+ > | _c0 | > +---------+ > | 135417 | > +---------+ > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > | Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) | > | Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE) | > | Reducer 6 <- Map 5 (SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 4 vectorized | > | File Output Operator [FS_49] | > | Group By Operator [GBY_48] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count(VALUE._col0)"] | > | <-Reducer 3 [CUSTOM_SIMPLE_EDGE] | > | PARTITION_ONLY_SHUFFLE [RS_21] | > | Group By Operator [GBY_20] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count()"] | > | Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) | > | Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS > STRING)(Inner) | > | <-Reducer 2 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_40] | > | PartitionCols:CAST( _col0 AS STRING) | > | Group By Operator [GBY_39] (rows=1613465 width=2359) | > | Output:["_col0"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_38] | > | PartitionCols:_col0 | > | Group By Operator [GBY_37] (rows=3226931 width=2359) | > | Output:["_col0"],keys:user_id | > | Filter Operator [FIL_36] (rows=3226931 width=2359) | > | predicate:user_id is not null | > | TableScan [TS_0] (rows=3396769 width=2359) | > | > odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] | > | <-Reducer 6 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_47] | > | PartitionCols:CAST( _col0 AS STRING) | > | Select Operator [SEL_46] (rows=900308 width=8) | > | Output:["_col0"] | > | Filter Operator [FIL_45] (rows=900308 width=572) | > | predicate:(row_number_window_0 = 1) | > | PTF Operator [PTF_44] (rows=1800617 width=572) | > | Function > definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC > NULLS LAST","partition by:":"_col2"}] | > | Select Operator [SEL_43] (rows=1800617 width=572) | > | Output:["_col2","_col4"] | > | <-Map 5 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_42] | > | PartitionCols:user_id | > | Filter Operator [FIL_41] (rows=1800617 > width=120) | > | predicate:((points_balance <> 0) and user_id > is not null) | > | TableScan [TS_7] (rows=1800617 width=120) | > | > odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] > | > | | > +----------------------------------------------------+ > ====== > 将关联表转化为子表,不用group by,结果也是正确的 > Convert the associated table to a subtable without using group by, and the > result is also correct > set hive.auto.convert.join=false; > select > count(1) > from > ( > select > user_id > from > tmp_dev.test11 > ) a > JOIN ( > select > user_id, > points_balance, > row_number() over( > partition by user_id > order by > points_balance desc > ) as rank_no > from > odb.test1111 --1800617 > where > etl_dt = '20240329' > and points_balance <> 0 > ) c on a.user_id = c.user_id; > where c.rank_no = 1 > > +---------+ > | _c0 | > +---------+ > | 135417 | > +---------+ > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) | > | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 3 vectorized | > | File Output Operator [FS_37] | > | Group By Operator [GBY_36] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count(VALUE._col0)"] | > | <-Reducer 2 [CUSTOM_SIMPLE_EDGE] | > | PARTITION_ONLY_SHUFFLE [RS_11] | > | Group By Operator [GBY_10] (rows=1 width=8) | > | Output:["_col0"],aggregations:["count()"] | > | Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) | > | Conds:RS_32._col0=RS_35._col0(Inner) | > | <-Map 1 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_32] | > | PartitionCols:_col0 | > | Select Operator [SEL_31] (rows=170303 width=8) | > | Output:["_col0"] | > | Filter Operator [FIL_30] (rows=170303 width=8) | > | predicate:user_id is not null | > | TableScan [TS_0] (rows=170303 width=8) | > | > tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] | > | <-Map 4 [SIMPLE_EDGE] vectorized | > | SHUFFLE [RS_35] | > | PartitionCols:_col0 | > | Select Operator [SEL_34] (rows=1800617 width=8) | > | Output:["_col0"] | > | Filter Operator [FIL_33] (rows=1800617 width=120) | > | predicate((points_balance <> 0) and user_id is not > null) | > | TableScan [TS_3] (rows=1800617 width=120) | > | > odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"] > | > | | > +----------------------------------------------------+ -- This message was sent by Atlassian Jira (v8.20.10#820010)