[ https://issues.apache.org/jira/browse/HIVE-26496?focusedWorklogId=808986&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-808986 ]
ASF GitHub Bot logged work on HIVE-26496: ----------------------------------------- Author: ASF GitHub Bot Created on: 15/Sep/22 07:38 Start Date: 15/Sep/22 07:38 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #3559: URL: https://github.com/apache/hive/pull/3559#discussion_r971635989 ########## ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java: ########## @@ -104,28 +103,44 @@ public OrcSplit(Path path, Object fileId, long offset, long length, String[] hos this.isOriginal = isOriginal; this.hasBase = hasBase; this.rootDir = rootDir; - this.deltas.addAll(filterDeltasByBucketId(deltas, AcidUtils.parseBucketId(path))); + int bucketId = AcidUtils.parseBucketId(path); + long minWriteId = !deltas.isEmpty() ? + AcidUtils.parseBaseOrDeltaBucketFilename(path, null).getMinimumWriteId() : -1; + this.deltas.addAll( + deltas.stream() + .filter(delta -> isQualifiedDeleteDeltasByWriteIds(delta, minWriteId)) Review Comment: maybe simply ```` .filter(delta -> delta.getMaxWriteId() >= minWriteId) ```` for better readability, no need to go inside the method to check the logic? Issue Time Tracking ------------------- Worklog Id: (was: 808986) Time Spent: 7h (was: 6h 50m) > FetchOperator scans delete_delta folders multiple times causing slowness > ------------------------------------------------------------------------ > > Key: HIVE-26496 > URL: https://issues.apache.org/jira/browse/HIVE-26496 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Reporter: Rajesh Balamohan > Assignee: Dmitriy Fingerman > Priority: Major > Labels: pull-request-available > Time Spent: 7h > Remaining Estimate: 0h > > FetchOperator scans way too many number of files/directories than needed. > For e.g here is a layout of a table which had set of updates and deletes. > There are set of "delta" and "delete_delta" folders which are created. > {noformat} > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_0000001 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000002_0000002_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000003_0000003_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000004_0000004_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000005_0000005_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000006_0000006_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000007_0000007_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000008_0000008_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000009_0000009_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000010_0000010_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000011_0000011_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000012_0000012_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000013_0000013_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000014_0000014_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000015_0000015_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000016_0000016_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000017_0000017_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000018_0000018_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000019_0000019_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000020_0000020_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000021_0000021_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_0000022_0000022_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000002_0000002_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000003_0000003_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000004_0000004_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000005_0000005_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000006_0000006_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000007_0000007_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000008_0000008_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000009_0000009_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000010_0000010_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000011_0000011_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000012_0000012_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000013_0000013_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000014_0000014_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000015_0000015_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000016_0000016_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000017_0000017_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000018_0000018_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000019_0000019_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000020_0000020_0000 > s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_0000021_0000021_0000 > {noformat} > > When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from > beeline, FetchOperator tries to compute splits in "date_dim". This "base" and > "delta" folders and computes 21 splits. > However, for each of the 21 splits, it ends up loading entire "delete_delta" > folders and scans unnecessarily. This increases the scan by "21 splits * 21 > delete_delta folders" (i.e 1396) times. This makes the statement execution > super slow, even when there is minimal dataset present in the table. > It will be good to scan only relevant delete_delta folder in the split, > instead of loading all delete_delta folders in every split. > > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1142] > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1172] > > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L402] > -- This message was sent by Atlassian Jira (v8.20.10#820010)