zhangyue19921010 opened a new issue #4163:
URL: https://github.com/apache/hudi/issues/4163


   **Describe the problem you faced**
   If there's a pending clustering instant still existed in active timeline 
after several archival actions.
   Next time we finish this pending clustering instant, this clustering instant 
may stain the ActiveTimeLine and lead to incomplete query results
   
   **To Reproduce**
   
   
   
   **Step 1** 
   Do a normal hudi insert 
   ```
   drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
20211130113918979.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.inflight
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
archived/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
hoodie.properties
   ```
   
   **Step 2** 
   Build a clustering plan but don't execute this plan
   20211130114103632.replacecommit.requested will cluster data files from 
20211130113918979.commit
   
   ```
   drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
20211130113918979.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
20211130114103632.replacecommit.requested
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
archived/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
hoodie.properties
   ```
   
   **Step 3** 
   Do a few times hudi insert and trigger several archivals
   
   ```
   drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:44 .temp/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
20211130113918979.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
20211130113918979.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
20211130114103632.replacecommit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:41 
20211130114122881.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
20211130114122881.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
20211130114122881.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:42 
20211130114207164.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
20211130114207164.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
20211130114207164.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:44 
20211130114351703.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
20211130114351703.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
20211130114351703.inflight
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
archived/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
hoodie.properties
   ```
   
   ```
   drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:23 .temp/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
20211130114103632.replacecommit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
20211130131825336.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
20211130131825336.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
20211130131825336.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
20211130132256488.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
20211130132256488.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
20211130132256488.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
20211130132327154.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
20211130132327154.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
20211130132327154.inflight
   drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 
archived/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 
hoodie.properties
   ```
   
   20211130114122881.commit 20211130114207164.commit and 
20211130114351703.commit were archived.
   
   **Step 4** 
   Do query to check record numbers and based hudi data files.
   ```
   val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
   => 
   +--------+
   |count(1)|
   +--------+
   |4217794 |
   +--------+
   
   val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
   =>
   +----------------------------------------------------------------------+
   |_hoodie_file_name                                                     |
   +----------------------------------------------------------------------+
   |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
   |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
   |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
   |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
   |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
   |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
   |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
   |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
   |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
   |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
   |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
   |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
   |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
   |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
   +----------------------------------------------------------------------+
   ```
   
   **Step 5** 
   Stop insert and trigger that pending clustering replace request
   
   ```
   drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
   drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:27 .temp/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  4736 11 30 13:27 
20211130114103632.replacecommit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:27 
20211130114103632.replacecommit.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
20211130114103632.replacecommit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
20211130131825336.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
20211130131825336.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
20211130131825336.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
20211130132256488.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
20211130132256488.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
20211130132256488.inflight
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
20211130132327154.commit
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
20211130132327154.commit.requested
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
20211130132327154.inflight
   drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 
archived/
   -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 
hoodie.properties
   ```
   
   **Step 6** 
   Do the same queries to check record numbers and based hudi data files.
   
   ```
   val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
   => 
   +--------+
   |count(1)|
   +--------+
   |2410168 |
   +--------+
   
   val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
   =>
   +----------------------------------------------------------------------+
   |_hoodie_file_name                                                     |
   +----------------------------------------------------------------------+
   |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
   |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
   |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
   |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
   |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
   |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
   |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
   |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
   +----------------------------------------------------------------------+
   ```
   
   
   As we can see, we get different query result compared with before-clustering 
and after-clustering.
   Also query result from Step 6 is missing records from these base file 
mentioned below.
   
   ```
   |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
   |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
   
   |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
   |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
   
   |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
   |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
   ```
   
   
   The root cause of this incomplete query results is that late finished 
clustering instant stain this activeTimeLine hoodie get wrong latest base file 
according to 
   
https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120
   
   To fix this bug, we need let pending clustering instant to block archive 
action like pending compaction did.
   
   P.S.
   Each ingestion will insert 602,542 records.
   ```
   20211130114103632.replacecommit
   {
     "partitionToWriteStats" : {
       "20210623" : [ {
         "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
         "path" : 
"20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet",
         "prevCommit" : "null",
         "numWrites" : 602542,
         "numDeletes" : 0,
         "numUpdateWrites" : 0,
         "numInserts" : 602542,
         "totalWriteBytes" : 17645296,
         "totalWriteErrors" : 0,
         "tempPath" : null,
         "partitionPath" : "20210623",
         "totalLogRecords" : 0,
         "totalLogFilesCompacted" : 0,
         "totalLogSizeCompacted" : 0,
         "totalUpdatedRecordsCompacted" : 0,
         "totalLogBlocks" : 0,
         "totalCorruptLogBlock" : 0,
         "totalRollbackBlocks" : 0,
         "fileSizeInBytes" : 17645296,
         "minEventTime" : null,
         "maxEventTime" : null
       } ]
     },
     "compacted" : false,
     "extraMetadata" : {
       "schema" : "xxxxx"
     },
     "operationType" : "CLUSTER",
     "partitionToReplaceFileIds" : {
       "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", 
"a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
     },
     "fileIdAndRelativePaths" : {
       "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : 
"20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet"
     },
     "totalRecordsDeleted" : 0,
     "totalLogRecordsCompacted" : 0,
     "totalLogFilesCompacted" : 0,
     "totalCompactedRecordsUpdated" : 0,
     "totalLogFilesSize" : 0,
     "totalScanTime" : 0,
     "totalCreateTime" : 11053,
     "totalUpsertTime" : 0,
     "minAndMaxEventTime" : {
       "Optional.empty" : {
         "val" : null,
         "present" : false
       }
     },
     "writePartitionPaths" : [ "20210623" ]
   }
   ```
   
   
   
   
   
   
   
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 2.4.4
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : 
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to