[jira] [Work logged] (HIVE-27327) Iceberg basic stats: Incorrect row count in snapshot summary leading to unoptimized plans

ASF GitHub Bot (Jira) Mon, 08 May 2023 10:33:08 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27327?focusedWorklogId=861059&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861059
 ]


ASF GitHub Bot logged work on HIVE-27327:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/May/23 17:32
            Start Date: 08/May/23 17:32
    Worklog Time Spent: 10m 
      Work Description: aturoczy commented on code in PR #4301:
URL: https://github.com/apache/hive/pull/4301#discussion_r1187698990


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -346,7 +346,16 @@ public Map<String, String> getBasicStatistics(Partish 
partish) {
               stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
             }
             if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
-              stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              long totalRecords = 
Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
+                  summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) 
{
+                Long actualRecords =
+                    totalRecords - 
(Long.parseLong(summary.get(SnapshotSummary.TOTAL_EQ_DELETES_PROP)) >

Review Comment:
   It is a code smell a bit imho.
   
   When you are using conditional operator your eyes need to parse a variable 
that you put it into 2 methods. So to read / debug the value you need to have 
an understanding what the summary.get returns and after the Long.prase (last is 
easier for sure.  Maybe It is fully OK and you can choose to not accept this 
opinion but I think it is nicer if you create a variable for those values. 
   



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -346,7 +346,16 @@ public Map<String, String> getBasicStatistics(Partish 
partish) {
               stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
             }
             if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
-              stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              long totalRecords = 
Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
+                  summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) 
{
+                Long actualRecords =
+                    totalRecords - 
(Long.parseLong(summary.get(SnapshotSummary.TOTAL_EQ_DELETES_PROP)) >

Review Comment:
   Btw why this getBasicStatistics needs switch case? 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 861059)
    Time Spent: 0.5h  (was: 20m)

> Iceberg basic stats: Incorrect row count in snapshot summary leading to 
> unoptimized plans
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-27327
>                 URL: https://issues.apache.org/jira/browse/HIVE-27327
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Simhadri Govindappa
>            Assignee: Simhadri Govindappa
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the absence of equality deletes, the total row count should be :
> {noformat}
> row_count = total-records - total-position-deletes{noformat}
>  
>  
> Example:
> After many inserts and deletes, there are only 46 records in a table.
> {noformat}
> >>select count(*) from llap_orders;
> +------+
> | _c0  |
> +------+
> | 46   |
> +------+
> 1 row selected (7.22 seconds)
> {noformat}
>  
> But the total records in snapshot summary indicate that there are 300 records
>  
> {noformat}
>  {
>     "sequence-number" : 19,
>     "snapshot-id" : 4237525869561629328,
>     "parent-snapshot-id" : 2572487769557272977,
>     "timestamp-ms" : 1683553017982,
>     "summary" : {
>       "operation" : "append",
>       "added-data-files" : "5",
>       "added-records" : "12",
>       "added-files-size" : "3613",
>       "changed-partition-count" : "5",
>       "total-records" : "300",
>       "total-files-size" : "164405",
>       "total-data-files" : "100",
>       "total-delete-files" : "73",
>       "total-position-deletes" : "254",
>       "total-equality-deletes" : "0"
>     }{noformat}
>  
> As a result of this , the hive plans generated are unoptimized.
> {noformat}
> 0: jdbc:hive2://simhadrigovindappa-2.simhadri> explain update llap_orders set 
> itemid=7 where itemid=5;
> INFO  : OK
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> | Reducer 3 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-4                                            |
> |   Stats Work{}                                     |
> |     Stage-0                                        |
> |       Move Operator                                |
> |         table:{"name:":"db.llap_orders"}           |
> |         Stage-3                                    |
> |           Dependency Collection{}                  |
> |             Stage-2                                |
> |               Reducer 2 vectorized                 |
> |               File Output Operator [FS_14]         |
> |                 table:{"name:":"db.llap_orders"}   |
> |                 Select Operator [SEL_13] (rows=150 width=424) |
> |                   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
>  |
> |                 <-Map 1 [SIMPLE_EDGE]              |
> |                   SHUFFLE [RS_4]                   |
> |                     Select Operator [SEL_3] (rows=150 width=424) |
> |                       
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9"]
>  |
> |                       Select Operator [SEL_2] (rows=150 width=644) |
> |                         
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9","_col10","_col11","_col13","_col14","_col15"]
>  |
> |                         Filter Operator [FIL_9] (rows=150 width=220) |
> |                           predicate:(itemid = 5)   |
> |                           TableScan [TS_0] (rows=300 width=220) |
> |                             
> db@llap_orders,llap_orders,Tbl:COMPLETE,Col:COMPLETE,Output:["orderid","quantity","itemid","tradets","p1","p2"]
>  |
> |               Reducer 3 vectorized                 |
> |               File Output Operator [FS_16]         |
> |                 table:{"name:":"db.llap_orders"}   |
> |                 Select Operator [SEL_15]           |
> |                   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col4","_col5"] |
> |                 <-Map 1 [SIMPLE_EDGE]              |
> |                   SHUFFLE [RS_10]                  |
> |                     PartitionCols:_col4, _col5     |
> |                     Select Operator [SEL_7] (rows=150 width=220) |
> |                       
> Output:["_col0","_col1","_col2","_col3","_col4","_col5"] |
> |                        Please refer to the previous Select Operator [SEL_2] 
> |
> |                                                    |
> +----------------------------------------------------+
> 39 rows selected (0.104 seconds)
> 0: jdbc:hive2://simhadrigovindappa-2.simhadri>{noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-27327) Iceberg basic stats: Incorrect row count in snapshot summary leading to unoptimized plans

Reply via email to