codope commented on issue #13002: URL: https://github.com/apache/hudi/issues/13002#issuecomment-2754893887
@gbcoder2020 What we see in the logs: ``` ExecutorLostFailure (executor 1493 exited caused by one of the running tasks) Reason: Executor heartbeat timed out ... ``` Spark tasks that were performing metadata compaction died repeatedly, leading Spark to abort the stage. Do you think the cluster was behaving erratically at the time? As to why the job later recovered with no config changes, I can only make a guess. When Hudi sees a stale or unfinished `.inflight` commit (e.g. from March 11th), the next write will rollback that incomplete commit or re-attempt the compaction in this case. That's why you see many rollbacks in the timeline between 11th and 16th March. The issue may or may not be related to log file markers added in 0.15.0 - https://github.com/apache/hudi/commit/c2c7e0538f8cf3031781ebdd776d1c03bfec3bb3. Since the table has recovered, so all the previous markers would be lost. It would have been helpful to take a backup of `.hoodie` when the issue happened. Nevertheless, the marker mechanism and heartbeat mechanism in Hudi are related and reconciliation of markers is attempted in post commit phase. Heartbeats (with timeout) tell whether the ongoing commit instant (and hence the writer) is alive or not. Heartbeats say: "Yes, I (the writer) am still active, don’t treat me as hung or stale." Marker files say: "These are the specific files I (the writer) plan to create/modify in this commit." Marker files are eventually deleted as the commit is successful. When a commit's heartbeat times out, Hudi will eventually roll back that commit. As part of the rollback, it cleans up the marker files and any partially written data files for that commit. So although marker files and heartbeats have separate jobs, they come together when dealing with a failed commit. Hope this gives some clarity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
