[I] Data loss due to the inability to obtain the completion time [hudi]

via GitHub Wed, 26 Nov 2025 08:12:26 -0800


TheR1sing3un opened a new issue, #14365:
URL: https://github.com/apache/hudi/issues/14365


   ### Bug Description
   
   **What happened:**
   Recently, we encountered a data compaction where some of the data written to 
the involved log did not appear in the new base file. 
   After investigation, we found that this log did not appear as scheduled in 
the compaction plan, even though the completion time of this log file was 
significantly earlier than that of compaction instant. In our scenario, there 
is an interval of nearly three days between writing and merging.
   
   > How did we investigate
   
   1. First, we found that the logic of the log file was filtered out here：
   
   <img width="1157" height="778" alt="Image" 
src="https://github.com/user-attachments/assets/670d83e6-2a45-40e3-aa40-138d98c5f5a8";
 />
   
   2. The reason for being filtered is that we read an instant from three days 
ago to find its corresponding completion time. This instant has already been 
archived, and we also queried the completion time in the archive timeline. It 
was found that it was also three days ago, which was correct. 
   3. However, this completion time was not obtained correctly here. Instead, 
null was returned, resulting in this log being determined to have been 
completed after this compaction instant, and thus this log was filtered out 
from the plan.
   
   > Then why didn't the compaction task retrieve the completion time from the 
archive timeline？
   
   Let's construct a case: 
   - There are a total of 10 instant items from 1 to 10, and items 1 to 6 have 
all been archived.
   - And 1 to 2 are archived into a 1_2 archive parquet file
   - And 3 to 6 are archived into a 3_6 archive parquet file
   
   `archived: [1_2.parquet, 3_6.parquet] ; active: [7-10]`
   
   1. Initialize the `CompletionTimeQueryViewV2`, and the cursor is located to 
the first active instant:
   
   <img width="1395" height="294" alt="Image" 
src="https://github.com/user-attachments/assets/af566899-6b3e-4100-a5c3-053a4dd01b35";
 />
   2. now we have stored instant from 7 to 10 in memory.
   3. we try to get completion time for instant 5, it will lazy load instants 
started from 5：
   
   <img width="934" height="318" alt="Image" 
src="https://github.com/user-attachments/assets/67dd4aa0-1a0d-4c99-bf85-b213f91a1054";
 />
   4. In the following scanning and loading logic, we will scan to file 
3_6.parquet and read instant 5 and 6 from it and store it in memory: 
   
   <img width="1307" height="756" alt="Image" 
src="https://github.com/user-attachments/assets/65798860-a2f3-46f6-895c-46fd9b837a15";
 />
   5. And now, we try to get completion time for instant 4, it will trigger 
lazy load again, and now it will load with filter: [4, 5)
   6. But, this time, we can't obtine the correct completion time, because we 
skipped reading the 3_6.parquet, and instant 4 is exactly in this file：
   
   <img width="1343" height="796" alt="Image" 
src="https://github.com/user-attachments/assets/a39ad201-9f8e-49d7-b6a7-58d3b25bb8f4";
 />
   7. As for why the file is filtered out, is it because the situation where 
the boundary of the filter is entirely contained within the min max of the file 
has not been taken into consider：
   
   <img width="1048" height="230" alt="Image" 
src="https://github.com/user-attachments/assets/c4dab8a0-1aa3-45aa-9bbd-791f02941fca";
 />
   
   **Steps to reproduce:**
   You can reproduce by this simple ut：
   
   <img width="961" height="474" alt="Image" 
src="https://github.com/user-attachments/assets/046fc927-ce73-43a7-834e-f0dd05f8b020";
 />
   
   <img width="1295" height="279" alt="Image" 
src="https://github.com/user-attachments/assets/7d247e12-0554-4ebb-aab7-0ec429de36d5";
 />
   
   ### Environment
   
   **Hudi version:**
   1.x
   **Query engine:** (Spark/Flink/Trino etc)
   
   **Relevant configs:**
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Data loss due to the inability to obtain the completion time [hudi]

Reply via email to