[ https://issues.apache.org/jira/browse/HIVE-27327?focusedWorklogId=861205&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861205 ]
ASF GitHub Bot logged work on HIVE-27327: ----------------------------------------- Author: ASF GitHub Bot Created on: 09/May/23 14:03 Start Date: 09/May/23 14:03 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #4301: URL: https://github.com/apache/hive/pull/4301#discussion_r1188653059 ########## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ########## @@ -346,7 +346,16 @@ public Map<String, String> getBasicStatistics(Partish partish) { stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); } if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { - stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); + long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); + if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) && + summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) { + Long actualRecords = Review Comment: count(*) should use stats instead of scanning the whole dataset. To be honest I don't really understand the purpose of 'total-records' if it doesn't reflect an accurate row count. insert 100 rows, delete all, 'total-records'=100 ??? Issue Time Tracking ------------------- Worklog Id: (was: 861205) Time Spent: 1h 40m (was: 1.5h) > Iceberg basic stats: Incorrect row count in snapshot summary leading to > unoptimized plans > ----------------------------------------------------------------------------------------- > > Key: HIVE-27327 > URL: https://issues.apache.org/jira/browse/HIVE-27327 > Project: Hive > Issue Type: Bug > Reporter: Simhadri Govindappa > Assignee: Simhadri Govindappa > Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > In the absence of equality deletes, the total row count should be : > {noformat} > row_count = total-records - total-position-deletes{noformat} > > > Example: > After many inserts and deletes, there are only 46 records in a table. > {noformat} > >>select count(*) from llap_orders; > +------+ > | _c0 | > +------+ > | 46 | > +------+ > 1 row selected (7.22 seconds) > {noformat} > > But the total records in snapshot summary indicate that there are 300 records > > {noformat} > { > "sequence-number" : 19, > "snapshot-id" : 4237525869561629328, > "parent-snapshot-id" : 2572487769557272977, > "timestamp-ms" : 1683553017982, > "summary" : { > "operation" : "append", > "added-data-files" : "5", > "added-records" : "12", > "added-files-size" : "3613", > "changed-partition-count" : "5", > "total-records" : "300", > "total-files-size" : "164405", > "total-data-files" : "100", > "total-delete-files" : "73", > "total-position-deletes" : "254", > "total-equality-deletes" : "0" > }{noformat} > > As a result of this , the hive plans generated are unoptimized. > {noformat} > 0: jdbc:hive2://simhadrigovindappa-2.simhadri> explain update llap_orders set > itemid=7 where itemid=5; > INFO : OK > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > | Reducer 3 <- Map 1 (SIMPLE_EDGE) | > | | > | Stage-4 | > | Stats Work{} | > | Stage-0 | > | Move Operator | > | table:{"name:":"db.llap_orders"} | > | Stage-3 | > | Dependency Collection{} | > | Stage-2 | > | Reducer 2 vectorized | > | File Output Operator [FS_14] | > | table:{"name:":"db.llap_orders"} | > | Select Operator [SEL_13] (rows=150 width=424) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"] > | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Select Operator [SEL_3] (rows=150 width=424) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9"] > | > | Select Operator [SEL_2] (rows=150 width=644) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9","_col10","_col11","_col13","_col14","_col15"] > | > | Filter Operator [FIL_9] (rows=150 width=220) | > | predicate:(itemid = 5) | > | TableScan [TS_0] (rows=300 width=220) | > | > db@llap_orders,llap_orders,Tbl:COMPLETE,Col:COMPLETE,Output:["orderid","quantity","itemid","tradets","p1","p2"] > | > | Reducer 3 vectorized | > | File Output Operator [FS_16] | > | table:{"name:":"db.llap_orders"} | > | Select Operator [SEL_15] | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col4","_col5"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_10] | > | PartitionCols:_col4, _col5 | > | Select Operator [SEL_7] (rows=150 width=220) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5"] | > | Please refer to the previous Select Operator [SEL_2] > | > | | > +----------------------------------------------------+ > 39 rows selected (0.104 seconds) > 0: jdbc:hive2://simhadrigovindappa-2.simhadri>{noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)