[ 
https://issues.apache.org/jira/browse/HIVE-26496?focusedWorklogId=808856&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-808856
 ]

ASF GitHub Bot logged work on HIVE-26496:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Sep/22 19:34
            Start Date: 14/Sep/22 19:34
    Worklog Time Spent: 10m 
      Work Description: difin commented on code in PR #3559:
URL: https://github.com/apache/hive/pull/3559#discussion_r971208288


##########
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java:
##########
@@ -104,13 +104,32 @@ public OrcSplit(Path path, Object fileId, long offset, 
long length, String[] hos
     this.isOriginal = isOriginal;
     this.hasBase = hasBase;
     this.rootDir = rootDir;
-    this.deltas.addAll(filterDeltasByBucketId(deltas, 
AcidUtils.parseBucketId(path)));
+    this.deltas.addAll(filterDeleteDeltasByWriteIds
+            (filterDeltasByBucketId(deltas, AcidUtils.parseBucketId(path)), 
conf));
     this.projColsUncompressedSize = projectedDataSize <= 0 ? length : 
projectedDataSize;
     // setting file length to Long.MAX_VALUE will let orc reader read file 
length from file system
     this.fileLen = fileLen <= 0 ? Long.MAX_VALUE : fileLen;
     this.syntheticAcidProps = syntheticAcidProps;
   }
 
+  /**
+   * For every split we want to filter out the delete deltas that contain 
events that happened only
+   * in the past relative to the split
+   * @param deltas
+   * @param conf
+   * @return
+   */
+  protected List<AcidInputFormat.DeltaMetaData> filterDeleteDeltasByWriteIds(
+          List<AcidInputFormat.DeltaMetaData> deltas, Configuration conf) 
throws IOException {
+
+    AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+            AcidUtils.parseBaseOrDeltaBucketFilename(getPath(), conf);

Review Comment:
   Hi @deniskuzZ,
   Many tests failed with the change of using 
AcidUtils.ParsedDeltaLight.parse() instead of 
AcidUtils.parseBaseOrDeltaBucketFilename() with exception on this line: 
   
https://github.com/apache/hive/blob/e352684d5c87df1483444afc4c3ee897270bd413/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L108
   
   As I understand the split is not always a delta folder, it can be some older 
format not supported by ParsedDeltaLight. I saw that ParsedDeltaLight.parse() 
is used in some cases internally in AcidUtils.parseBaseOrDeltaBucketFilename(), 
but not always. Can you please advise if I should revert to using 
AcidUtils.parseBaseOrDeltaBucketFilename() that worked in all cases or there is 
some better way?
   
   
https://github.com/apache/hive/blob/f6bd0eb80767adfa9ce9f47a6d02a4940903effb/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L538-L552





Issue Time Tracking
-------------------

    Worklog Id:     (was: 808856)
    Time Spent: 5.5h  (was: 5h 20m)

> FetchOperator scans delete_delta folders multiple times causing slowness
> ------------------------------------------------------------------------
>
>                 Key: HIVE-26496
>                 URL: https://issues.apache.org/jira/browse/HIVE-26496
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Rajesh Balamohan
>            Assignee: Dmitriy Fingerman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> FetchOperator scans way too many number of files/directories than needed.
> For e.g here is a layout of a table which had set of updates and deletes. 
> There are set of "delta" and "delete_delta" folders which are created.
> {noformat}
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_0000001
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000002_0000002_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000003_0000003_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000004_0000004_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000005_0000005_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000006_0000006_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000007_0000007_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000008_0000008_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000009_0000009_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000010_0000010_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000011_0000011_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000012_0000012_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000013_0000013_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000014_0000014_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000015_0000015_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000016_0000016_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000017_0000017_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000018_0000018_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000019_0000019_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000020_0000020_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000021_0000021_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000022_0000022_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000002_0000002_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000003_0000003_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000004_0000004_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000005_0000005_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000006_0000006_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000007_0000007_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000008_0000008_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000009_0000009_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000010_0000010_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000011_0000011_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000012_0000012_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000013_0000013_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000014_0000014_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000015_0000015_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000016_0000016_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000017_0000017_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000018_0000018_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000019_0000019_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000020_0000020_0000
> s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000021_0000021_0000
> {noformat}
>  
> When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from 
> beeline, FetchOperator tries to compute splits in "date_dim". This "base" and 
> "delta" folders and computes 21 splits.
> However, for each of the 21 splits, it ends up loading entire "delete_delta" 
> folders and scans unnecessarily. This increases the scan by "21 splits * 21 
> delete_delta folders" (i.e 1396) times. This makes the statement execution 
> super slow, even when there is minimal dataset present in the table.
> It will be good to scan only relevant delete_delta folder in the split, 
> instead of loading all delete_delta folders in every split.
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142]
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172]
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to