Hi devs,
After an offline discussion with the Apache Celeborn folks, we changed the
signatures of "snapshotState" and "retoreState" as follows:
void snapshotState(CompletableFuture snapshotFuture,
ShuffleMasterSnapshotContext context);
void restoreState(List snapshots);
We believe the above sign
Hi Paul,
I believe Xintong has answered your question.
>> IIUC, in the FLIP, the main method is lost after the recovery, and only
submitted jobs would be recovered. Is that right?
You are right, we can't recover the execution progress of main method. So
after JM crashs, only the submitted and in
@Paul,
Do you mean the scenario where users call `evn.execute()` multiple times in
the `main()` method? I believe that is not supported currently when HA is
enabled, for the exact same reason you mentioned that Flink is not aware of
which jobs are executed and which are not.
On the other hand,
Thanks for addressing my comments, Lijie. LGTM
Best,
Xintong
On Tue, Dec 5, 2023 at 2:56 PM Paul Lam wrote:
> Hi Lijie,
>
> Recovery for batch jobs is no doubt a long-awaited feature. Thanks for
> the proposal!
>
> I’m concerned about the multi-job scenario. In session mode, users could
> us
Hi Lijie,
Recovery for batch jobs is no doubt a long-awaited feature. Thanks for
the proposal!
I’m concerned about the multi-job scenario. In session mode, users could
use web submission to upload and run jars which may produce multiple
Flink jobs. However, these jobs may not be submitted at onc
Thanks for raising this valueable point, Xintong
Supporting external shuffle service makes sense to me. In order to recover
the internal states of ShuffleMaster after JM restarts, we will add the
following 3 methods to ShuffleMaster:
boolean supportsBatchSnapshot();
void snapshotState(Completable
Thanks for the proposal, Lijie and Zhu.
I have been having offline discussions with the Apache Celeborn folks
regarding integrating Apache Celeborn into Flink's Hybrid Shuffle mode. One
thing coming from those discussions that might relate to this FLIP is that
Celeborn maintains some internal stat
Hi Guowei,
Thanks for your feedback.
>> As far as I know, there are multiple job managers on standby in some
scenarios. In this case, is your design still effective?
I think it's still effective. There will only be one leader. After becoming
the leader, the startup process of JobMaster is the sam
Hi,
This is a very good proposal, as far as I know, it can solve some very
critical production operations in certain scenarios. I have two minor
issues:
As far as I know, there are multiple job managers on standby in some
scenarios. In this case, is your design still effective? I'm unsure if you
Hi devs,
Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job
Recovery for Batch Jobs[1]
Currently, when Flink’s job manager crashes or gets killed, possibly due to
unexpected errors or planned nodes decommission, it will cause the
following two situations:
1. Failed, if the
10 matches
Mail list logo