Rajesh Balamohan created HIVE-27099:
---------------------------------------

             Summary: Iceberg: select count(*) from table queries all data
                 Key: HIVE-27099
                 URL: https://issues.apache.org/jira/browse/HIVE-27099
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


select count is scanning all data. Though it has complete basic stats, it 
launched tez job which wasn't needed. Second issue is, it ended up scanning 
ENTIRE 148 GB dataset which is completely not required. It should have got the 
data from parq files itself. Ideal situation is getting entire records from 
manifest itself.

Data is stored in parquet format in external tables. This may be broken for 
parquet, as for ORC it is able to read less data (footer info). 

1. Consider fixing count( * ) for parq
2. Check if it is possible to read stats from iceberg manifests after #1.


{noformat}

explain select count(*) from store_sales;

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
      Edges:
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  Statistics: Num rows: 2879966589 Data size: 195666988943 
Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    Statistics: Num rows: 2879966589 Data size: 195666988943 
Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      aggregations: count()
                      minReductionHashAggr: 0.5
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        null sort order:
                        sort order:
                        Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                        value expressions: _col0 (type: bigint)
            Execution mode: vectorized
        Reducer 2
            Execution mode: vectorized
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

53 rows selected (1.454 seconds)

0: jdbc:hive2://ve0:218> select count(*) from store_sales;
INFO  : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO  : Session is already open
INFO  : Dag name: select count(*) from store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1676286357243_0061)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED    767        767        0        0    
   0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0    
   0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 54.94 s
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 54.85 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : Compile Query                           1.42s
INFO  : Prepare Plan                            0.18s
INFO  : Get Query Coordinator (AM)              0.01s
INFO  : Submit Plan                             0.37s
INFO  : Start DAG                               0.08s
INFO  : Run DAG                                54.84s
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :
INFO  : Task Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :      Map 1          41326.00      4,280,860         28,890   
2,879,966,589            1,332
INFO  :  Reducer 2          26123.00          4,750             34             
767                0
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :
INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 768
INFO  :    TOTAL_LAUNCHED_TASKS: 768
INFO  :    DATA_LOCAL_TASKS: 677
INFO  :    RACK_LOCAL_TASKS: 90
INFO  :    AM_CPU_MILLISECONDS: 101240
INFO  :    AM_GC_TIME_MILLIS: 156
INFO  : File System Counters:
INFO  :    FILE_BYTES_READ: 2820
INFO  :    FILE_BYTES_WRITTEN: 46020
INFO  :    HDFS_BYTES_WRITTEN: 110
INFO  :    HDFS_READ_OPS: 2
INFO  :    HDFS_WRITE_OPS: 2
INFO  :    HDFS_OP_CREATE: 1
INFO  :    HDFS_OP_GET_FILE_STATUS: 2
INFO  :    HDFS_OP_RENAME: 1
INFO  :    OFS_BYTES_READ: 148270746936
INFO  :    OFS_READ_OPS: 11058
...

{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to