codope commented on issue #13002:
URL: https://github.com/apache/hudi/issues/13002#issuecomment-2754893887

   @gbcoder2020 What we see in the logs:
   
   ```
   ExecutorLostFailure (executor 1493 exited caused by one of the running 
tasks) Reason: Executor heartbeat timed out ...
   ```
   Spark tasks that were performing metadata compaction died repeatedly, 
leading Spark to abort the stage. Do you think the cluster was behaving 
erratically at the time?
   
   As to why the job later recovered with no config changes, I can only make a 
guess. When Hudi sees a stale or unfinished `.inflight` commit (e.g. from March 
11th), the next write will rollback that incomplete commit or re-attempt the 
compaction in this case. That's why you see many rollbacks in the timeline 
between 11th and 16th March.
   
   The issue may or may not be related to log file markers added in 0.15.0 - 
https://github.com/apache/hudi/commit/c2c7e0538f8cf3031781ebdd776d1c03bfec3bb3. 
Since the table has recovered, so all the previous markers would be lost. It 
would have been helpful to take a backup of `.hoodie` when the issue happened. 
   
   Nevertheless, the marker mechanism and heartbeat mechanism in Hudi are 
related and reconciliation of markers is attempted in post commit phase. 
Heartbeats (with timeout) tell whether the ongoing commit instant (and hence 
the writer) is alive or not. Heartbeats say: "Yes, I (the writer) am still 
active, don’t treat me as hung or stale." Marker files say: "These are the 
specific files I (the writer) plan to create/modify in this commit." Marker 
files are eventually deleted as the commit is successful. When a commit's 
heartbeat times out, Hudi will eventually roll back that commit. As part of the 
rollback, it cleans up the marker files and any partially written data files 
for that commit. So although marker files and heartbeats have separate jobs, 
they come together when dealing with a failed commit. Hope this gives some 
clarity. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to