Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-11 Thread Lijie Wang
Hi devs, After an offline discussion with the Apache Celeborn folks, we changed the signatures of "snapshotState" and "retoreState" as follows: void snapshotState(CompletableFuture snapshotFuture, ShuffleMasterSnapshotContext context); void restoreState(List snapshots); We believe the above sign

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Lijie Wang
Hi Paul, I believe Xintong has answered your question. >> IIUC, in the FLIP, the main method is lost after the recovery, and only submitted jobs would be recovered. Is that right? You are right, we can't recover the execution progress of main method. So after JM crashs, only the submitted and in

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Xintong Song
@Paul, Do you mean the scenario where users call `evn.execute()` multiple times in the `main()` method? I believe that is not supported currently when HA is enabled, for the exact same reason you mentioned that Flink is not aware of which jobs are executed and which are not. On the other hand,

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Xintong Song
Thanks for addressing my comments, Lijie. LGTM Best, Xintong On Tue, Dec 5, 2023 at 2:56 PM Paul Lam wrote: > Hi Lijie, > > Recovery for batch jobs is no doubt a long-awaited feature. Thanks for > the proposal! > > I’m concerned about the multi-job scenario. In session mode, users could > us

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-04 Thread Paul Lam
Hi Lijie, Recovery for batch jobs is no doubt a long-awaited feature. Thanks for the proposal! I’m concerned about the multi-job scenario. In session mode, users could use web submission to upload and run jars which may produce multiple Flink jobs. However, these jobs may not be submitted at onc

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-04 Thread Lijie Wang
Thanks for raising this valueable point, Xintong Supporting external shuffle service makes sense to me. In order to recover the internal states of ShuffleMaster after JM restarts, we will add the following 3 methods to ShuffleMaster: boolean supportsBatchSnapshot(); void snapshotState(Completable

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-03 Thread Xintong Song
Thanks for the proposal, Lijie and Zhu. I have been having offline discussions with the Apache Celeborn folks regarding integrating Apache Celeborn into Flink's Hybrid Shuffle mode. One thing coming from those discussions that might relate to this FLIP is that Celeborn maintains some internal stat

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-30 Thread Lijie Wang
Hi Guowei, Thanks for your feedback. >> As far as I know, there are multiple job managers on standby in some scenarios. In this case, is your design still effective? I think it's still effective. There will only be one leader. After becoming the leader, the startup process of JobMaster is the sam

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-18 Thread Guowei Ma
Hi, This is a very good proposal, as far as I know, it can solve some very critical production operations in certain scenarios. I have two minor issues: As far as I know, there are multiple job managers on standby in some scenarios. In this case, is your design still effective? I'm unsure if you

[DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-02 Thread Lijie Wang
Hi devs, Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job Recovery for Batch Jobs[1] Currently, when Flink’s job manager crashes or gets killed, possibly due to unexpected errors or planned nodes decommission, it will cause the following two situations: 1. Failed, if the