[jira] [Commented] (HIVE-26365) Remove column statistics collection task from merge statement plan

Stamatis Zampetakis (Jira) Thu, 30 Jun 2022 00:28:08 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560881#comment-17560881
 ]


Stamatis Zampetakis commented on HIVE-26365:
--------------------------------------------

What happens if the MERGE statement has only INSERT branches? In this case it 
seems that collecting stats makes sense and could potentially be exploitable 
too.

> Remove column statistics collection task from merge statement plan 
> -------------------------------------------------------------------
>
>                 Key: HIVE-26365
>                 URL: https://issues.apache.org/jira/browse/HIVE-26365
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Merge statements may contain delete and update branches. Update is 
> technically a delete and an insert operation. Column statistics can not be 
> calculated in case of delete operations from the changed records. Example: 
> min, max.
> Currently Hive marks the column stats of the target table invalid after 
> Update/Delete/Merge but for merge extra GBY operators and reducers are 
> generated for insert branches to calculate column stats and Stats works are 
> collecting Column stats too.
> {code}
> POSTHOOK: query: explain
> merge into acidTbl_n0 as t using nonAcidOrcTbl_n0 s ON t.a = s.a
> WHEN MATCHED AND s.a > 8 THEN DELETE
> WHEN MATCHED THEN UPDATE SET b = 7
> WHEN NOT MATCHED THEN INSERT VALUES(s.a, s.b)
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@acidtbl_n0
> POSTHOOK: Input: default@nonacidorctbl_n0
> POSTHOOK: Output: default@acidtbl_n0
> POSTHOOK: Output: default@acidtbl_n0
> POSTHOOK: Output: default@merge_tmp_table
> STAGE DEPENDENCIES:
>   Stage-5 is a root stage
>   Stage-6 depends on stages: Stage-5
>   Stage-0 depends on stages: Stage-6
>   Stage-7 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-6
>   Stage-8 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-6
>   Stage-9 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-6
>   Stage-10 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-6
>   Stage-11 depends on stages: Stage-4
> STAGE PLANS:
>   Stage: Stage-5
>     Tez
> #### A masked pattern was here ####
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 10 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 4 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 5 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 6 <- Reducer 5 (CUSTOM_SIMPLE_EDGE)
>         Reducer 7 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 8 <- Reducer 7 (CUSTOM_SIMPLE_EDGE)
>         Reducer 9 <- Reducer 2 (SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Select Operator
>                     expressions: a (type: int), b (type: int)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: _col0 (type: int)
>                       null sort order: z
>                       sort order: +
>                       Map-reduce partition columns: _col0 (type: int)
>                       Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col1 (type: int)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>         Map 10 
>             Map Operator Tree:
>                 TableScan
>                   alias: t
>                   filterExpr: a is not null (type: boolean)
>                   Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Filter Operator
>                     predicate: a is not null (type: boolean)
>                     Statistics: Num rows: 2 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: a (type: int), ROW__ID (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2 Data size: 160 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 2 Data size: 160 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>             Execution mode: vectorized, llap
>             LLAP IO: may be used (ACID table)
>         Reducer 2 
>             Execution mode: llap
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Left Outer Join 0 to 1
>                 keys:
>                   0 _col0 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1, _col2, _col3
>                 Statistics: Num rows: 6 Data size: 288 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: _col3 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>), _col1 (type: int), _col2 
> (type: int), _col0 (type: int)
>                   outputColumnNames: _col0, _col1, _col2, _col3
>                   Statistics: Num rows: 6 Data size: 288 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 > 8)) (type: 
> boolean)
>                     Statistics: Num rows: 1 Data size: 88 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: UDFToInteger(_col0) 
> (type: int)
>                         Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: 
> boolean)
>                     Statistics: Num rows: 2 Data size: 176 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: UDFToInteger(_col0) 
> (type: int)
>                         Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: 
> boolean)
>                     Statistics: Num rows: 2 Data size: 176 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col2 (type: int), 7 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>                   Filter Operator
>                     predicate: _col2 is null (type: boolean)
>                     Statistics: Num rows: 4 Data size: 192 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col3 (type: int), _col1 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>                   Filter Operator
>                     predicate: (_col2 = _col3) (type: boolean)
>                     Statistics: Num rows: 3 Data size: 184 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 3 Data size: 184 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Group By Operator
>                         aggregations: count()
>                         keys: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         minReductionHashAggr: 0.4
>                         mode: hash
>                         outputColumnNames: _col0, _col1
>                         Statistics: Num rows: 2 Data size: 168 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                           null sort order: z
>                           sort order: +
>                           Map-reduce partition columns: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                           Statistics: Num rows: 2 Data size: 168 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           value expressions: _col1 (type: bigint)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: DELETE
>         Reducer 4 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: DELETE
>         Reducer 5 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.5
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 6 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 7 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.75
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 8 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 9 
>             Execution mode: llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 168 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Filter Operator
>                   predicate: (_col1 > 1L) (type: boolean)
>                   Statistics: Num rows: 1 Data size: 84 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Select Operator
>                     expressions: cardinality_violation(_col0) (type: int)
>                     outputColumnNames: _col0
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       table:
>                           input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                           output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                           name: default.merge_tmp_table
>   Stage: Stage-6
>     Dependency Collection
>   Stage: Stage-0
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: DELETE
>   Stage: Stage-7
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-1
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: DELETE
>   Stage: Stage-8
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-2
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: INSERT
>   Stage: Stage-9
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-3
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: INSERT
>   Stage: Stage-10
>     Stats Work
>       Basic Stats Work:
>       Column Stats Desc:
>           Columns: a, b
>           Column Types: int, int
>           Table: default.acidtbl_n0
>   Stage: Stage-4
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: default.merge_tmp_table
>   Stage: Stage-11
>     Stats Work
>       Basic Stats Work:
> {code}
> One of the insert Reducers and the follow-up Reducer for col stats collecting:
> {code}
>         Reducer 5 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.5
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 6 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26365) Remove column statistics collection task from merge statement plan

Reply via email to