Re: [PR] [HUDI-9570] Flink AsyncInstant may lost data when trigger recommit [hudi]

via GitHub Mon, 14 Jul 2025 23:48:38 -0700


cshuo commented on PR #13530:
URL: https://github.com/apache/hudi/pull/13530#issuecomment-3072241488


   > > One concern is whether we should avoid introducing new RPC calls as much 
as possible, since @danny0405 also mentioned there actually exists chance the 
RPC event sending may fail.
   > > Can we perform committing of the residual write metadata events in 
`handleBootstrapEvent` after cleaning legacy events for the restarting case: 
attemptId > 0. I think it can also solve the problem happened in step 2.
   > > > Step 2: The task automatically restarted and relaunched. It began 
writing data based on Commit 2. Since the attempt count was > 0, no recommit 
operation was performed.
   > 
   > Agreed. However, in this scenario, if a transmission failure occurs, the 
pending instant still won't be successfully re-committed, and thus will remain 
retained in the state. Therefore, this shouldn't pose significant issues.
   
   Yes, the fix in this pr also works well even RPC failure occurs, but it 
makes the writing/committing more complicated. 
   
   After discussing the @danny0405 offline, we came up with another 
[fix](https://github.com/apache/hudi/pull/13543) on the writer coordinator 
side, relying on checkpoint of coordinator to persist the uncommitted write 
meta events and recommit them during job restoring. It would be great if you 
can help verify it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9570] Flink AsyncInstant may lost data when trigger recommit [hudi]

Reply via email to