subject:"\[DISCUSS\] FLIP\-383\: Support Job Recovery for Batch Jobs"

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-11 Thread Lijie Wang

Hi devs, After an offline discussion with the Apache Celeborn folks, we changed the signatures of "snapshotState" and "retoreState" as follows: void snapshotState(CompletableFuture snapshotFuture, ShuffleMasterSnapshotContext context); void restoreState(List snapshots); We believe the above sign

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Lijie Wang

Hi Paul, I believe Xintong has answered your question. >> IIUC, in the FLIP, the main method is lost after the recovery, and only submitted jobs would be recovered. Is that right? You are right, we can't recover the execution progress of main method. So after JM crashs, only the submitted and in

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Xintong Song

@Paul, Do you mean the scenario where users call `evn.execute()` multiple times in the `main()` method? I believe that is not supported currently when HA is enabled, for the exact same reason you mentioned that Flink is not aware of which jobs are executed and which are not. On the other hand,

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-05 Thread Xintong Song

Thanks for addressing my comments, Lijie. LGTM Best, Xintong On Tue, Dec 5, 2023 at 2:56 PM Paul Lam wrote: > Hi Lijie, > > Recovery for batch jobs is no doubt a long-awaited feature. Thanks for > the proposal! > > I’m concerned about the multi-job scenario. In session mode, users could > us

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-04 Thread Paul Lam

Hi Lijie, Recovery for batch jobs is no doubt a long-awaited feature. Thanks for the proposal! I’m concerned about the multi-job scenario. In session mode, users could use web submission to upload and run jars which may produce multiple Flink jobs. However, these jobs may not be submitted at onc

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-04 Thread Lijie Wang

Thanks for raising this valueable point, Xintong Supporting external shuffle service makes sense to me. In order to recover the internal states of ShuffleMaster after JM restarts, we will add the following 3 methods to ShuffleMaster: boolean supportsBatchSnapshot(); void snapshotState(Completable

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-12-03 Thread Xintong Song

Thanks for the proposal, Lijie and Zhu. I have been having offline discussions with the Apache Celeborn folks regarding integrating Apache Celeborn into Flink's Hybrid Shuffle mode. One thing coming from those discussions that might relate to this FLIP is that Celeborn maintains some internal stat

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-30 Thread Lijie Wang

Hi Guowei, Thanks for your feedback. >> As far as I know, there are multiple job managers on standby in some scenarios. In this case, is your design still effective? I think it's still effective. There will only be one leader. After becoming the leader, the startup process of JobMaster is the sam

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-18 Thread Guowei Ma

Hi, This is a very good proposal, as far as I know, it can solve some very critical production operations in certain scenarios. I have two minor issues: As far as I know, there are multiple job managers on standby in some scenarios. In this case, is your design still effective? I'm unsure if you

[DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

2023-11-02 Thread Lijie Wang

Hi devs, Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job Recovery for Batch Jobs[1] Currently, when Flink’s job manager crashes or gets killed, possibly due to unexpected errors or planned nodes decommission, it will cause the following two situations: 1. Failed, if the

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

[DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

10 matches

Site Navigation

Mail list logo

Footer information