[ 
https://issues.apache.org/jira/browse/HIVE-26496?focusedWorklogId=810384&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-810384
 ]

ASF GitHub Bot logged work on HIVE-26496:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Sep/22 13:38
            Start Date: 20/Sep/22 13:38
    Worklog Time Spent: 10m 
      Work Description: difin commented on code in PR #3606:
URL: https://github.com/apache/hive/pull/3606#discussion_r975377018


##########
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java:
##########
@@ -103,7 +103,7 @@ public OrcSplit(Path path, Object fileId, long offset, long 
length, String[] hos
     this.isOriginal = isOriginal;
     this.hasBase = hasBase;
     this.rootDir = rootDir;
-    int bucketId = AcidUtils.parseBucketId(path);
+    bucketId = AcidUtils.parseBucketId(path);

Review Comment:
   I didn't find explanation in history why bucket id was set in parse method. 
Regarding making bucketId final: bucketId is a private class member and cannot 
be set outside of OrcSplit. Before my change it was set using getPath() method 
from its parent class. This method uses 'fs' class member from the parent class 
which is package-private and unaccessible from OrcSplit. So, bucketId is set 
using path which cannot be modified and it can have only one value. Therefore 
in my opinion bucketId can be set final and it is safe to do. In other words, 
if bucketId is set in constructor or in parse method, it can be assigned only 
same unique value which is based on the path and the path can't be changed at 
runtime.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 810384)
    Time Spent: 9.5h  (was: 9h 20m)

> FetchOperator scans delete_delta folders multiple times causing slowness
> ------------------------------------------------------------------------
>
>                 Key: HIVE-26496
>                 URL: https://issues.apache.org/jira/browse/HIVE-26496
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Rajesh Balamohan
>            Assignee: Dmitriy Fingerman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> FetchOperator scans way too many number of files/directories than needed.
> For e.g here is a layout of a table which had set of updates and deletes. 
> There are set of "delta" and "delete_delta" folders which are created.
> {noformat}
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_0000001
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000002_0000002_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000003_0000003_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000004_0000004_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000005_0000005_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000006_0000006_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000007_0000007_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000008_0000008_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000009_0000009_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000010_0000010_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000011_0000011_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000012_0000012_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000013_0000013_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000014_0000014_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000015_0000015_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000016_0000016_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000017_0000017_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000018_0000018_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000019_0000019_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000020_0000020_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000021_0000021_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000022_0000022_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000002_0000002_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000003_0000003_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000004_0000004_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000005_0000005_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000006_0000006_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000007_0000007_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000008_0000008_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000009_0000009_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000010_0000010_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000011_0000011_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000012_0000012_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000013_0000013_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000014_0000014_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000015_0000015_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000016_0000016_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000017_0000017_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000018_0000018_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000019_0000019_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000020_0000020_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000021_0000021_0000
> {noformat}
>  
> When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from 
> beeline, FetchOperator tries to compute splits in "date_dim". This "base" and 
> "delta" folders and computes 21 splits.
> However, for each of the 21 splits, it ends up loading entire "delete_delta" 
> folders and scans unnecessarily. This increases the scan by "21 splits * 21 
> delete_delta folders" (i.e 1396) times. This makes the statement execution 
> super slow, even when there is minimal dataset present in the table.
> It will be good to scan only relevant delete_delta folder in the split, 
> instead of loading all delete_delta folders in every split.
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142]
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172]
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to