Rajesh Balamohan created HIVE-27099: ---------------------------------------
Summary: Iceberg: select count(*) from table queries all data Key: HIVE-27099 URL: https://issues.apache.org/jira/browse/HIVE-27099 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan select count is scanning all data. Though it has complete basic stats, it launched tez job which wasn't needed. Second issue is, it ended up scanning ENTIRE 148 GB dataset which is completely not required. It should have got the data from parq files itself. Ideal situation is getting entire records from manifest itself. Data is stored in parquet format in external tables. This may be broken for parquet, as for ORC it is able to read less data (footer info). 1. Consider fixing count( * ) for parq 2. Check if it is possible to read stats from iceberg manifests after #1. {noformat} explain select count(*) from store_sales; Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales Statistics: Num rows: 2879966589 Data size: 195666988943 Basic stats: COMPLETE Column stats: COMPLETE Select Operator Statistics: Num rows: 2879966589 Data size: 195666988943 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() minReductionHashAggr: 0.5 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Execution mode: vectorized Reducer 2 Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink 53 rows selected (1.454 seconds) 0: jdbc:hive2://ve0:218> select count(*) from store_sales; INFO : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 INFO : Session is already open INFO : Dag name: select count(*) from store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1676286357243_0061) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 767 767 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 54.94 s ---------------------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 54.85 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 1.42s INFO : Prepare Plan 0.18s INFO : Get Query Coordinator (AM) 0.01s INFO : Submit Plan 0.37s INFO : Start DAG 0.08s INFO : Run DAG 54.84s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 41326.00 4,280,860 28,890 2,879,966,589 1,332 INFO : Reducer 2 26123.00 4,750 34 767 0 INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_SUCCEEDED_TASKS: 768 INFO : TOTAL_LAUNCHED_TASKS: 768 INFO : DATA_LOCAL_TASKS: 677 INFO : RACK_LOCAL_TASKS: 90 INFO : AM_CPU_MILLISECONDS: 101240 INFO : AM_GC_TIME_MILLIS: 156 INFO : File System Counters: INFO : FILE_BYTES_READ: 2820 INFO : FILE_BYTES_WRITTEN: 46020 INFO : HDFS_BYTES_WRITTEN: 110 INFO : HDFS_READ_OPS: 2 INFO : HDFS_WRITE_OPS: 2 INFO : HDFS_OP_CREATE: 1 INFO : HDFS_OP_GET_FILE_STATUS: 2 INFO : HDFS_OP_RENAME: 1 INFO : OFS_BYTES_READ: 148270746936 INFO : OFS_READ_OPS: 11058 ... {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)