[jira] [Updated] (HIVE-28223) Hive on tez: Association of large and small tables satisfies four conditions simultaneously, resulting in data loss: 1. Close mapjoin, 2. Association has group by, 3. Association has windowing function, and 4. Association condition is int/bigint

zhaojk (Jira) Fri, 26 Apr 2024 05:52:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhaojk updated HIVE-28223:
--------------------------
    Description: 
When the following four conditions are met during the SQL runtime, the result 
may be abnormal. The first condition is that there is no mapjoin association. 
The second condition is that there is a group by in the first paragraph of SQL 
logic. The third condition is that there is a row number over partition 
windowing function in the second paragraph of SQL. The fourth condition is that 
the association condition is of type int or bigint. When these four conditions 
are met simultaneously, the result will be missing data.
If one section of SQL is changed to a subtable, the result is correct 
regardless of whether mapjoin is enabled or not, and converting the association 
condition to a string result is also correct. Detailed SQL and execution plan

大表和小表的SQL运行过程中，满足以下四个条件的时候导致结果异常，第一点没有走mapjoin关联，第二点第一段SQL逻辑中有group 
by，第三点第二段SQL中有row number over 
partition的开窗函数，第四点关联条件为int或者bigint类型，同时满足这四个条件的时候结果就会缺失数据。
如果将其中一段sql改为子表，不管是否开启mapjoin结果都是正确的，将关联条件转换为string结果也是正确的，详细SQL及执行计划


hive 3.1.0  TEZ0.9.1

关闭mapjoin，结果异常
Close mapjoin, result is abnormal

set hive.auto.convert.join=false;
select
count(1)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id
  where  c.rank_no = 1;
  
+--------+
|  _c0   |
+--------+
| 67594  |
+--------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) |
| Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 5 vectorized                         |
|       File Output Operator [FS_43]                 |
|         Group By Operator [GBY_42] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 4 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_21]           |
|             Group By Operator [GBY_20] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
|                 Conds:GBY_5._col0=SEL_12._col0(Inner) |
|               <-Group By Operator [GBY_5] (rows=1613465 width=2359) |
|                   Output:["_col0"],keys:KEY._col0  |
|               <-Select Operator [SEL_12] (rows=900308 width=8) |
|                   Output:["_col0"]                 |
|                   Filter Operator [FIL_28] (rows=900308 width=572) |
|                     predicate:(row_number_window_0 = 1) |
|                     PTF Operator [PTF_11] (rows=1800617 width=572) |
|                       Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                       Select Operator [SEL_10] (rows=1800617 width=572) |
|                         Output:["_col2","_col4"]   |
|                       <-Map 1 [SIMPLE_EDGE] vectorized |
|                         SHUFFLE [RS_39]            |
|                           PartitionCols:_col0      |
|                           Group By Operator [GBY_38] (rows=3226931 
width=2359) |
|                             Output:["_col0"],keys:user_id |
|                             Filter Operator [FIL_37] (rows=3226931 
width=2359) |
|                               predicate:user_id is not null |
|                               TableScan [TS_0] (rows=3396769 width=2359) |
|                                 
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|                       <-Map 3 [SIMPLE_EDGE] vectorized |
|                         SHUFFLE [RS_41]            |
|                           PartitionCols:user_id    |
|                           Filter Operator [FIL_40] (rows=1800617 width=120) |
|                             predicate ((points_balance <> 0) and user_id is 
not null) |
|                             TableScan [TS_7] (rows=1800617 width=120) |
|                               
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+


开启mapjoin，结果正确
Enable mapjoin, the result is correct

set hive.auto.convert.join=true;
select
count(1)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id
  where  c.rank_no = 1;
  
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
| Reducer 5 <- Map 4 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_52]                 |
|         Group By Operator [GBY_51] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized |
|           PARTITION_ONLY_SHUFFLE [RS_50]           |
|             Group By Operator [GBY_49] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) |
|                 Conds:GBY_47._col0=RS_46._col0(Inner) |
|               <-Reducer 5 [BROADCAST_EDGE] vectorized |
|                 BROADCAST [RS_46]                  |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_45] (rows=900308 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_44] (rows=900308 width=572) |
|                       predicate:(row_number_window_0 = 1) |
|                       PTF Operator [PTF_43] (rows=1800617 width=572) |
|                         Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                         Select Operator [SEL_42] (rows=1800617 width=572) |
|                           Output:["_col2","_col4"] |
|                         <-Map 4 [SIMPLE_EDGE] vectorized |
|                           SHUFFLE [RS_41]          |
|                             PartitionCols:user_id  |
|                             Filter Operator [FIL_40] (rows=1800617 width=120) 
|
|                               predicate:((points_balance <> 0) and user_id is 
not null) |
|                               TableScan [TS_7] (rows=1800617 width=120) |
|                                 
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|               <-Group By Operator [GBY_47] (rows=1613465 width=2359) |
|                   Output:["_col0"],keys:KEY._col0  |
|                 <-Map 1 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_39]                  |
|                     PartitionCols:_col0            |
|                     Group By Operator [GBY_38] (rows=3226931 width=2359) |
|                       Output:["_col0"],keys:user_id |
|                       Filter Operator [FIL_37] (rows=3226931 width=2359) |
|                         predicate:user_id is not null |
|                         TableScan [TS_0] (rows=3396769 width=2359) |
|                           
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|                                                    |
+----------------------------------------------------+


--关联条件cast为string，结果正确

The association condition cast is string, and the result is correct

set hive.auto.convert.join=false;
select
count(1)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on cast(a.user_id as string) = cast(c.user_id as string)
  where  c.rank_no = 1;
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) |
| Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE)        |
| Reducer 6 <- Map 5 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 4 vectorized                         |
|       File Output Operator [FS_49]                 |
|         Group By Operator [GBY_48] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 3 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_21]           |
|             Group By Operator [GBY_20] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
|                 Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS 
STRING)(Inner) |
|               <-Reducer 2 [SIMPLE_EDGE] vectorized |
|                 SHUFFLE [RS_40]                    |
|                   PartitionCols:CAST( _col0 AS STRING) |
|                   Group By Operator [GBY_39] (rows=1613465 width=2359) |
|                     Output:["_col0"],keys:KEY._col0 |
|                   <-Map 1 [SIMPLE_EDGE] vectorized |
|                     SHUFFLE [RS_38]                |
|                       PartitionCols:_col0          |
|                       Group By Operator [GBY_37] (rows=3226931 width=2359) |
|                         Output:["_col0"],keys:user_id |
|                         Filter Operator [FIL_36] (rows=3226931 width=2359) |
|                           predicate:user_id is not null |
|                           TableScan [TS_0] (rows=3396769 width=2359) |
|                             
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|               <-Reducer 6 [SIMPLE_EDGE] vectorized |
|                 SHUFFLE [RS_47]                    |
|                   PartitionCols:CAST( _col0 AS STRING) |
|                   Select Operator [SEL_46] (rows=900308 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_45] (rows=900308 width=572) |
|                       predicate:(row_number_window_0 = 1) |
|                       PTF Operator [PTF_44] (rows=1800617 width=572) |
|                         Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                         Select Operator [SEL_43] (rows=1800617 width=572) |
|                           Output:["_col2","_col4"] |
|                         <-Map 5 [SIMPLE_EDGE] vectorized |
|                           SHUFFLE [RS_42]          |
|                             PartitionCols:user_id  |
|                             Filter Operator [FIL_41] (rows=1800617 width=120) 
|
|                               predicate:((points_balance <> 0) and user_id is 
not null) |
|                               TableScan [TS_7] (rows=1800617 width=120) |
|                                 
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+

======
将关联表转化为子表，不用group by，结果也是正确的
Convert the associated table to a subtable without using group by, and the 
result is also correct

set hive.auto.convert.join=false;
select
count(1)
from
  (
    select
      user_id
    from
      tmp_dev.test11
  ) a
  JOIN (
        select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id;
  where  c.rank_no = 1
  
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_37]                 |
|         Group By Operator [GBY_36] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_11]           |
|             Group By Operator [GBY_10] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) |
|                 Conds:RS_32._col0=RS_35._col0(Inner) |
|               <-Map 1 [SIMPLE_EDGE] vectorized     |
|                 SHUFFLE [RS_32]                    |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_31] (rows=170303 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_30] (rows=170303 width=8) |
|                       predicate:user_id is not null |
|                       TableScan [TS_0] (rows=170303 width=8) |
|                         
tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] |
|               <-Map 4 [SIMPLE_EDGE] vectorized     |
|                 SHUFFLE [RS_35]                    |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_34] (rows=1800617 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_33] (rows=1800617 width=120) |
|                       predicate((points_balance <> 0) and user_id is not 
null) |
|                       TableScan [TS_3] (rows=1800617 width=120) |
|                         
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+

  was:
When the following four conditions are met during the SQL runtime, the result 
may be abnormal. The first condition is that there is no mapjoin association. 
The second condition is that there is a group by in the first paragraph of SQL 
logic. The third condition is that there is a row number over partition 
windowing function in the second paragraph of SQL. The fourth condition is that 
the association condition is of type int or bigint. When these four conditions 
are met simultaneously, the result will be missing data.
If one section of SQL is changed to a subtable, the result is correct 
regardless of whether mapjoin is enabled or not, and converting the association 
condition to a string result is also correct. Detailed SQL and execution plan

大表和小表的SQL运行过程中，满足以下四个条件的时候导致结果异常，第一点没有走mapjoin关联，第二点第一段SQL逻辑中有group 
by，第三点第二段SQL中有row number over 
partition的开窗函数，第四点关联条件为int或者bigint类型，同时满足这四个条件的时候结果就会缺失数据。
如果将其中一段sql改为子表，不管是否开启mapjoin结果都是正确的，将关联条件转换为string结果也是正确的，详细SQL及执行计划


hive 3.1.0  TEZ0.9.1

关闭mapjoin，结果异常
Close mapjoin, result is abnormal

set hive.auto.convert.join=false;
select
count(*)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id
  where  c.rank_no = 1;
  
+--------+
|  _c0   |
+--------+
| 67594  |
+--------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) |
| Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 5 vectorized                         |
|       File Output Operator [FS_43]                 |
|         Group By Operator [GBY_42] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 4 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_21]           |
|             Group By Operator [GBY_20] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
|                 Conds:GBY_5._col0=SEL_12._col0(Inner) |
|               <-Group By Operator [GBY_5] (rows=1613465 width=2359) |
|                   Output:["_col0"],keys:KEY._col0  |
|               <-Select Operator [SEL_12] (rows=900308 width=8) |
|                   Output:["_col0"]                 |
|                   Filter Operator [FIL_28] (rows=900308 width=572) |
|                     predicate:(row_number_window_0 = 1) |
|                     PTF Operator [PTF_11] (rows=1800617 width=572) |
|                       Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                       Select Operator [SEL_10] (rows=1800617 width=572) |
|                         Output:["_col2","_col4"]   |
|                       <-Map 1 [SIMPLE_EDGE] vectorized |
|                         SHUFFLE [RS_39]            |
|                           PartitionCols:_col0      |
|                           Group By Operator [GBY_38] (rows=3226931 
width=2359) |
|                             Output:["_col0"],keys:user_id |
|                             Filter Operator [FIL_37] (rows=3226931 
width=2359) |
|                               predicate:user_id is not null |
|                               TableScan [TS_0] (rows=3396769 width=2359) |
|                                 
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|                       <-Map 3 [SIMPLE_EDGE] vectorized |
|                         SHUFFLE [RS_41]            |
|                           PartitionCols:user_id    |
|                           Filter Operator [FIL_40] (rows=1800617 width=120) |
|                             predicate:((points_balance <> 0) and user_id is 
not null) |
|                             TableScan [TS_7] (rows=1800617 width=120) |
|                               
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+


开启mapjoin，结果正确
Enable mapjoin, the result is correct

set hive.auto.convert.join=true;
select
count(*)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id
  where  c.rank_no = 1;
  
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
| Reducer 5 <- Map 4 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_52]                 |
|         Group By Operator [GBY_51] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized |
|           PARTITION_ONLY_SHUFFLE [RS_50]           |
|             Group By Operator [GBY_49] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) |
|                 Conds:GBY_47._col0=RS_46._col0(Inner) |
|               <-Reducer 5 [BROADCAST_EDGE] vectorized |
|                 BROADCAST [RS_46]                  |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_45] (rows=900308 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_44] (rows=900308 width=572) |
|                       predicate:(row_number_window_0 = 1) |
|                       PTF Operator [PTF_43] (rows=1800617 width=572) |
|                         Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                         Select Operator [SEL_42] (rows=1800617 width=572) |
|                           Output:["_col2","_col4"] |
|                         <-Map 4 [SIMPLE_EDGE] vectorized |
|                           SHUFFLE [RS_41]          |
|                             PartitionCols:user_id  |
|                             Filter Operator [FIL_40] (rows=1800617 width=120) 
|
|                               predicate:((points_balance <> 0) and user_id is 
not null) |
|                               TableScan [TS_7] (rows=1800617 width=120) |
|                                 
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|               <-Group By Operator [GBY_47] (rows=1613465 width=2359) |
|                   Output:["_col0"],keys:KEY._col0  |
|                 <-Map 1 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_39]                  |
|                     PartitionCols:_col0            |
|                     Group By Operator [GBY_38] (rows=3226931 width=2359) |
|                       Output:["_col0"],keys:user_id |
|                       Filter Operator [FIL_37] (rows=3226931 width=2359) |
|                         predicate:user_id is not null |
|                         TableScan [TS_0] (rows=3396769 width=2359) |
|                           
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|                                                    |
+----------------------------------------------------+


--关联条件cast为string，结果正确

The association condition cast is string, and the result is correct

set hive.auto.convert.join=false;
select
count(*)
from
  (
    select
      user_id
    from
      odb.test0000   --10524428
    where
      etl_dt = '20240329'
    group by
      user_id
  ) a
  JOIN (
    select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on cast(a.user_id as string) = cast(c.user_id as string)
  where  c.rank_no = 1;
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) |
| Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE)        |
| Reducer 6 <- Map 5 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 4 vectorized                         |
|       File Output Operator [FS_49]                 |
|         Group By Operator [GBY_48] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 3 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_21]           |
|             Group By Operator [GBY_20] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
|                 Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS 
STRING)(Inner) |
|               <-Reducer 2 [SIMPLE_EDGE] vectorized |
|                 SHUFFLE [RS_40]                    |
|                   PartitionCols:CAST( _col0 AS STRING) |
|                   Group By Operator [GBY_39] (rows=1613465 width=2359) |
|                     Output:["_col0"],keys:KEY._col0 |
|                   <-Map 1 [SIMPLE_EDGE] vectorized |
|                     SHUFFLE [RS_38]                |
|                       PartitionCols:_col0          |
|                       Group By Operator [GBY_37] (rows=3226931 width=2359) |
|                         Output:["_col0"],keys:user_id |
|                         Filter Operator [FIL_36] (rows=3226931 width=2359) |
|                           predicate:user_id is not null |
|                           TableScan [TS_0] (rows=3396769 width=2359) |
|                             
odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
|               <-Reducer 6 [SIMPLE_EDGE] vectorized |
|                 SHUFFLE [RS_47]                    |
|                   PartitionCols:CAST( _col0 AS STRING) |
|                   Select Operator [SEL_46] (rows=900308 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_45] (rows=900308 width=572) |
|                       predicate:(row_number_window_0 = 1) |
|                       PTF Operator [PTF_44] (rows=1800617 width=572) |
|                         Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC NULLS 
LAST","partition by:":"_col2"}] |
|                         Select Operator [SEL_43] (rows=1800617 width=572) |
|                           Output:["_col2","_col4"] |
|                         <-Map 5 [SIMPLE_EDGE] vectorized |
|                           SHUFFLE [RS_42]          |
|                             PartitionCols:user_id  |
|                             Filter Operator [FIL_41] (rows=1800617 width=120) 
|
|                               predicate:((points_balance <> 0) and user_id is 
not null) |
|                               TableScan [TS_7] (rows=1800617 width=120) |
|                                 
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+

======
将关联表转化为子表，不用group by，结果也是正确的
Convert the associated table to a subtable without using group by, and the 
result is also correct

set hive.auto.convert.join=false;
select
count(*)
from
  (
    select
      user_id
    from
      tmp_dev.test11
  ) a
  JOIN (
        select
      user_id,
      points_balance,
      row_number() over(
        partition by user_id
        order by
          points_balance desc
      ) as rank_no
    from
      odb.test1111  --1800617
    where
      etl_dt = '20240329'
      and points_balance <> 0
  ) c on a.user_id = c.user_id;
  where  c.rank_no = 1
  
+---------+
|   _c0   |
+---------+
| 135417  |
+---------+

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_37]                 |
|         Group By Operator [GBY_36] (rows=1 width=8) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_11]           |
|             Group By Operator [GBY_10] (rows=1 width=8) |
|               Output:["_col0"],aggregations:["count()"] |
|               Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) |
|                 Conds:RS_32._col0=RS_35._col0(Inner) |
|               <-Map 1 [SIMPLE_EDGE] vectorized     |
|                 SHUFFLE [RS_32]                    |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_31] (rows=170303 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_30] (rows=170303 width=8) |
|                       predicate:user_id is not null |
|                       TableScan [TS_0] (rows=170303 width=8) |
|                         
tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] |
|               <-Map 4 [SIMPLE_EDGE] vectorized     |
|                 SHUFFLE [RS_35]                    |
|                   PartitionCols:_col0              |
|                   Select Operator [SEL_34] (rows=1800617 width=8) |
|                     Output:["_col0"]               |
|                     Filter Operator [FIL_33] (rows=1800617 width=120) |
|                       predicate:((points_balance <> 0) and user_id is not 
null) |
|                       TableScan [TS_3] (rows=1800617 width=120) |
|                         
odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
 |
|                                                    |
+----------------------------------------------------+


> Hive on tez: Association of large and small tables satisfies four conditions 
> simultaneously, resulting in data loss: 1. Close mapjoin, 2. Association has 
> group by, 3. Association has windowing function, and 4. Association condition 
> is int/bigint
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-28223
>                 URL: https://issues.apache.org/jira/browse/HIVE-28223
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: zhaojk
>            Priority: Major
>
> When the following four conditions are met during the SQL runtime, the result 
> may be abnormal. The first condition is that there is no mapjoin association. 
> The second condition is that there is a group by in the first paragraph of 
> SQL logic. The third condition is that there is a row number over partition 
> windowing function in the second paragraph of SQL. The fourth condition is 
> that the association condition is of type int or bigint. When these four 
> conditions are met simultaneously, the result will be missing data.
> If one section of SQL is changed to a subtable, the result is correct 
> regardless of whether mapjoin is enabled or not, and converting the 
> association condition to a string result is also correct. Detailed SQL and 
> execution plan
> 大表和小表的SQL运行过程中，满足以下四个条件的时候导致结果异常，第一点没有走mapjoin关联，第二点第一段SQL逻辑中有group 
> by，第三点第二段SQL中有row number over 
> partition的开窗函数，第四点关联条件为int或者bigint类型，同时满足这四个条件的时候结果就会缺失数据。
> 如果将其中一段sql改为子表，不管是否开启mapjoin结果都是正确的，将关联条件转换为string结果也是正确的，详细SQL及执行计划
> hive 3.1.0  TEZ0.9.1
> 关闭mapjoin，结果异常
> Close mapjoin, result is abnormal
> set hive.auto.convert.join=false;
> select
> count(1)
> from
>   (
>     select
>       user_id
>     from
>       odb.test0000   --10524428
>     where
>       etl_dt = '20240329'
>     group by
>       user_id
>   ) a
>   JOIN (
>     select
>       user_id,
>       points_balance,
>       row_number() over(
>         partition by user_id
>         order by
>           points_balance desc
>       ) as rank_no
>     from
>       odb.test1111  --1800617
>     where
>       etl_dt = '20240329'
>       and points_balance <> 0
>   ) c on a.user_id = c.user_id
>   where  c.rank_no = 1;
>   
> +--------+
> |  _c0   |
> +--------+
> | 67594  |
> +--------+
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 4 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) |
> | Reducer 5 <- Reducer 4 (CUSTOM_SIMPLE_EDGE)        |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 5 vectorized                         |
> |       File Output Operator [FS_43]                 |
> |         Group By Operator [GBY_42] (rows=1 width=8) |
> |           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
> |         <-Reducer 4 [CUSTOM_SIMPLE_EDGE]           |
> |           PARTITION_ONLY_SHUFFLE [RS_21]           |
> |             Group By Operator [GBY_20] (rows=1 width=8) |
> |               Output:["_col0"],aggregations:["count()"] |
> |               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
> |                 Conds:GBY_5._col0=SEL_12._col0(Inner) |
> |               <-Group By Operator [GBY_5] (rows=1613465 width=2359) |
> |                   Output:["_col0"],keys:KEY._col0  |
> |               <-Select Operator [SEL_12] (rows=900308 width=8) |
> |                   Output:["_col0"]                 |
> |                   Filter Operator [FIL_28] (rows=900308 width=572) |
> |                     predicate:(row_number_window_0 = 1) |
> |                     PTF Operator [PTF_11] (rows=1800617 width=572) |
> |                       Function 
> definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC 
> NULLS LAST","partition by:":"_col2"}] |
> |                       Select Operator [SEL_10] (rows=1800617 width=572) |
> |                         Output:["_col2","_col4"]   |
> |                       <-Map 1 [SIMPLE_EDGE] vectorized |
> |                         SHUFFLE [RS_39]            |
> |                           PartitionCols:_col0      |
> |                           Group By Operator [GBY_38] (rows=3226931 
> width=2359) |
> |                             Output:["_col0"],keys:user_id |
> |                             Filter Operator [FIL_37] (rows=3226931 
> width=2359) |
> |                               predicate:user_id is not null |
> |                               TableScan [TS_0] (rows=3396769 width=2359) |
> |                                 
> odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
> |                       <-Map 3 [SIMPLE_EDGE] vectorized |
> |                         SHUFFLE [RS_41]            |
> |                           PartitionCols:user_id    |
> |                           Filter Operator [FIL_40] (rows=1800617 width=120) 
> |
> |                             predicate ((points_balance <> 0) and user_id is 
> not null) |
> |                             TableScan [TS_7] (rows=1800617 width=120) |
> |                               
> odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
>  |
> |                                                    |
> +----------------------------------------------------+
> 开启mapjoin，结果正确
> Enable mapjoin, the result is correct
> set hive.auto.convert.join=true;
> select
> count(1)
> from
>   (
>     select
>       user_id
>     from
>       odb.test0000   --10524428
>     where
>       etl_dt = '20240329'
>     group by
>       user_id
>   ) a
>   JOIN (
>     select
>       user_id,
>       points_balance,
>       row_number() over(
>         partition by user_id
>         order by
>           points_balance desc
>       ) as rank_no
>     from
>       odb.test1111  --1800617
>     where
>       etl_dt = '20240329'
>       and points_balance <> 0
>   ) c on a.user_id = c.user_id
>   where  c.rank_no = 1;
>   
> +---------+
> |   _c0   |
> +---------+
> | 135417  |
> +---------+
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE) |
> | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
> | Reducer 5 <- Map 4 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 3 vectorized                         |
> |       File Output Operator [FS_52]                 |
> |         Group By Operator [GBY_51] (rows=1 width=8) |
> |           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
> |         <-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorized |
> |           PARTITION_ONLY_SHUFFLE [RS_50]           |
> |             Group By Operator [GBY_49] (rows=1 width=8) |
> |               Output:["_col0"],aggregations:["count()"] |
> |               Map Join Operator [MAPJOIN_48] (rows=1774811 width=2359) |
> |                 Conds:GBY_47._col0=RS_46._col0(Inner) |
> |               <-Reducer 5 [BROADCAST_EDGE] vectorized |
> |                 BROADCAST [RS_46]                  |
> |                   PartitionCols:_col0              |
> |                   Select Operator [SEL_45] (rows=900308 width=8) |
> |                     Output:["_col0"]               |
> |                     Filter Operator [FIL_44] (rows=900308 width=572) |
> |                       predicate:(row_number_window_0 = 1) |
> |                       PTF Operator [PTF_43] (rows=1800617 width=572) |
> |                         Function 
> definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC 
> NULLS LAST","partition by:":"_col2"}] |
> |                         Select Operator [SEL_42] (rows=1800617 width=572) |
> |                           Output:["_col2","_col4"] |
> |                         <-Map 4 [SIMPLE_EDGE] vectorized |
> |                           SHUFFLE [RS_41]          |
> |                             PartitionCols:user_id  |
> |                             Filter Operator [FIL_40] (rows=1800617 
> width=120) |
> |                               predicate:((points_balance <> 0) and user_id 
> is not null) |
> |                               TableScan [TS_7] (rows=1800617 width=120) |
> |                                 
> odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
>  |
> |               <-Group By Operator [GBY_47] (rows=1613465 width=2359) |
> |                   Output:["_col0"],keys:KEY._col0  |
> |                 <-Map 1 [SIMPLE_EDGE] vectorized   |
> |                   SHUFFLE [RS_39]                  |
> |                     PartitionCols:_col0            |
> |                     Group By Operator [GBY_38] (rows=3226931 width=2359) |
> |                       Output:["_col0"],keys:user_id |
> |                       Filter Operator [FIL_37] (rows=3226931 width=2359) |
> |                         predicate:user_id is not null |
> |                         TableScan [TS_0] (rows=3396769 width=2359) |
> |                           
> odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
> |                                                    |
> +----------------------------------------------------+
> --关联条件cast为string，结果正确
> The association condition cast is string, and the result is correct
> set hive.auto.convert.join=false;
> select
> count(1)
> from
>   (
>     select
>       user_id
>     from
>       odb.test0000   --10524428
>     where
>       etl_dt = '20240329'
>     group by
>       user_id
>   ) a
>   JOIN (
>     select
>       user_id,
>       points_balance,
>       row_number() over(
>         partition by user_id
>         order by
>           points_balance desc
>       ) as rank_no
>     from
>       odb.test1111  --1800617
>     where
>       etl_dt = '20240329'
>       and points_balance <> 0
>   ) c on cast(a.user_id as string) = cast(c.user_id as string)
>   where  c.rank_no = 1;
> +---------+
> |   _c0   |
> +---------+
> | 135417  |
> +---------+
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> | Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE) |
> | Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE)        |
> | Reducer 6 <- Map 5 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 4 vectorized                         |
> |       File Output Operator [FS_49]                 |
> |         Group By Operator [GBY_48] (rows=1 width=8) |
> |           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
> |         <-Reducer 3 [CUSTOM_SIMPLE_EDGE]           |
> |           PARTITION_ONLY_SHUFFLE [RS_21]           |
> |             Group By Operator [GBY_20] (rows=1 width=8) |
> |               Output:["_col0"],aggregations:["count()"] |
> |               Merge Join Operator [MERGEJOIN_35] (rows=1774811 width=2359) |
> |                 Conds:RS_40.CAST( _col0 AS STRING)=RS_47.CAST( _col0 AS 
> STRING)(Inner) |
> |               <-Reducer 2 [SIMPLE_EDGE] vectorized |
> |                 SHUFFLE [RS_40]                    |
> |                   PartitionCols:CAST( _col0 AS STRING) |
> |                   Group By Operator [GBY_39] (rows=1613465 width=2359) |
> |                     Output:["_col0"],keys:KEY._col0 |
> |                   <-Map 1 [SIMPLE_EDGE] vectorized |
> |                     SHUFFLE [RS_38]                |
> |                       PartitionCols:_col0          |
> |                       Group By Operator [GBY_37] (rows=3226931 width=2359) |
> |                         Output:["_col0"],keys:user_id |
> |                         Filter Operator [FIL_36] (rows=3226931 width=2359) |
> |                           predicate:user_id is not null |
> |                           TableScan [TS_0] (rows=3396769 width=2359) |
> |                             
> odb@test0000,test0000,Tbl:PARTIAL,Col:NONE,Output:["user_id"] |
> |               <-Reducer 6 [SIMPLE_EDGE] vectorized |
> |                 SHUFFLE [RS_47]                    |
> |                   PartitionCols:CAST( _col0 AS STRING) |
> |                   Select Operator [SEL_46] (rows=900308 width=8) |
> |                     Output:["_col0"]               |
> |                     Filter Operator [FIL_45] (rows=900308 width=572) |
> |                       predicate:(row_number_window_0 = 1) |
> |                       PTF Operator [PTF_44] (rows=1800617 width=572) |
> |                         Function 
> definitions:[{},{"name:":"windowingtablefunction","order by:":"_col4 DESC 
> NULLS LAST","partition by:":"_col2"}] |
> |                         Select Operator [SEL_43] (rows=1800617 width=572) |
> |                           Output:["_col2","_col4"] |
> |                         <-Map 5 [SIMPLE_EDGE] vectorized |
> |                           SHUFFLE [RS_42]          |
> |                             PartitionCols:user_id  |
> |                             Filter Operator [FIL_41] (rows=1800617 
> width=120) |
> |                               predicate:((points_balance <> 0) and user_id 
> is not null) |
> |                               TableScan [TS_7] (rows=1800617 width=120) |
> |                                 
> odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
>  |
> |                                                    |
> +----------------------------------------------------+
> ======
> 将关联表转化为子表，不用group by，结果也是正确的
> Convert the associated table to a subtable without using group by, and the 
> result is also correct
> set hive.auto.convert.join=false;
> select
> count(1)
> from
>   (
>     select
>       user_id
>     from
>       tmp_dev.test11
>   ) a
>   JOIN (
>         select
>       user_id,
>       points_balance,
>       row_number() over(
>         partition by user_id
>         order by
>           points_balance desc
>       ) as rank_no
>     from
>       odb.test1111  --1800617
>     where
>       etl_dt = '20240329'
>       and points_balance <> 0
>   ) c on a.user_id = c.user_id;
>   where  c.rank_no = 1
>   
> +---------+
> |   _c0   |
> +---------+
> | 135417  |
> +---------+
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) |
> | Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 3 vectorized                         |
> |       File Output Operator [FS_37]                 |
> |         Group By Operator [GBY_36] (rows=1 width=8) |
> |           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
> |         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
> |           PARTITION_ONLY_SHUFFLE [RS_11]           |
> |             Group By Operator [GBY_10] (rows=1 width=8) |
> |               Output:["_col0"],aggregations:["count()"] |
> |               Merge Join Operator [MERGEJOIN_29] (rows=170303 width=8) |
> |                 Conds:RS_32._col0=RS_35._col0(Inner) |
> |               <-Map 1 [SIMPLE_EDGE] vectorized     |
> |                 SHUFFLE [RS_32]                    |
> |                   PartitionCols:_col0              |
> |                   Select Operator [SEL_31] (rows=170303 width=8) |
> |                     Output:["_col0"]               |
> |                     Filter Operator [FIL_30] (rows=170303 width=8) |
> |                       predicate:user_id is not null |
> |                       TableScan [TS_0] (rows=170303 width=8) |
> |                         
> tmp_dev@test11,test11,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id"] |
> |               <-Map 4 [SIMPLE_EDGE] vectorized     |
> |                 SHUFFLE [RS_35]                    |
> |                   PartitionCols:_col0              |
> |                   Select Operator [SEL_34] (rows=1800617 width=8) |
> |                     Output:["_col0"]               |
> |                     Filter Operator [FIL_33] (rows=1800617 width=120) |
> |                       predicate((points_balance <> 0) and user_id is not 
> null) |
> |                       TableScan [TS_3] (rows=1800617 width=120) |
> |                         
> odb@test1111,test1111,Tbl:COMPLETE,Col:COMPLETE,Output:["user_id","points_balance"]
>  |
> |                                                    |
> +----------------------------------------------------+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28223) Hive on tez: Association of large and small tables satisfies four conditions simultaneously, resulting in data loss: 1. Close mapjoin, 2. Association has group by, 3. Association has windowing function, and 4. Association condition is int/bigint

Reply via email to