zhangyue19921010 opened a new issue #4163:
URL: https://github.com/apache/hudi/issues/4163
**Describe the problem you faced**
If there's a pending clustering instant still existed in active timeline
after several archival actions.
Next time we finish this pending clustering instant, this clustering instant
may stain the ActiveTimeLine and lead to incomplete query results
**To Reproduce**
**Step 1**
Do a normal hudi insert
```
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39
20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.inflight
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39
archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39
hoodie.properties
```
**Step 2**
Build a clustering plan but don't execute this plan
20211130114103632.replacecommit.requested will cluster data files from
20211130113918979.commit
```
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39
20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41
20211130114103632.replacecommit.requested
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39
archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39
hoodie.properties
```
**Step 3**
Do a few times hudi insert and trigger several archivals
```
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39
20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39
20211130113918979.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41
20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41
20211130114122881.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41
20211130114122881.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41
20211130114122881.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42
20211130114207164.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42
20211130114207164.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42
20211130114207164.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44
20211130114351703.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43
20211130114351703.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43
20211130114351703.inflight
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39
archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39
hoodie.properties
```
```
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17
20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18
20211130131825336.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18
20211130131825336.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18
20211130131825336.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23
20211130132256488.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22
20211130132256488.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22
20211130132256488.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23
20211130132327154.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23
20211130132327154.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23
20211130132327154.inflight
drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23
archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17
hoodie.properties
```
20211130114122881.commit 20211130114207164.commit and
20211130114351703.commit were archived.
**Step 4**
Do query to check record numbers and based hudi data files.
```
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=>
+--------+
|count(1)|
+--------+
|4217794 |
+--------+
val frame = spark.sql("select distinct(_hoodie_file_name) from
hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
+----------------------------------------------------------------------+
```
**Step 5**
Stop insert and trigger that pending clustering replace request
```
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27
20211130114103632.replacecommit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27
20211130114103632.replacecommit.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17
20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18
20211130131825336.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18
20211130131825336.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18
20211130131825336.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23
20211130132256488.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22
20211130132256488.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22
20211130132256488.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23
20211130132327154.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23
20211130132327154.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23
20211130132327154.inflight
drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23
archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17
hoodie.properties
```
**Step 6**
Do the same queries to check record numbers and based hudi data files.
```
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=>
+--------+
|count(1)|
+--------+
|2410168 |
+--------+
val frame = spark.sql("select distinct(_hoodie_file_name) from
hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
+----------------------------------------------------------------------+
```
As we can see, we get different query result compared with before-clustering
and after-clustering.
Also query result from Step 6 is missing records from these base file
mentioned below.
```
|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
```
The root cause of this incomplete query results is that late finished
clustering instant stain this activeTimeLine hoodie get wrong latest base file
according to
https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120
To fix this bug, we need let pending clustering instant to block archive
action like pending compaction did.
P.S.
Each ingestion will insert 602,542 records.
```
20211130114103632.replacecommit
{
"partitionToWriteStats" : {
"20210623" : [ {
"fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
"path" :
"20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet",
"prevCommit" : "null",
"numWrites" : 602542,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 602542,
"totalWriteBytes" : 17645296,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "20210623",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 17645296,
"minEventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "xxxxx"
},
"operationType" : "CLUSTER",
"partitionToReplaceFileIds" : {
"20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0",
"a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
},
"fileIdAndRelativePaths" : {
"9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" :
"20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet"
},
"totalRecordsDeleted" : 0,
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 11053,
"totalUpsertTime" : 0,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
},
"writePartitionPaths" : [ "20210623" ]
}
```
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : master
* Spark version : 2.4.4
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) :
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]