[ https://issues.apache.org/jira/browse/FLINK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-36873: ----------------------------------- Labels: pull-request-available (was: ) > Adapting batch job progress recovery to Apache Celeborn > ------------------------------------------------------- > > Key: FLINK-36873 > URL: https://issues.apache.org/jira/browse/FLINK-36873 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Reporter: xuhuang > Assignee: Junrui Lee > Priority: Major > Labels: pull-request-available > > I've identified several issues while attempting to enable Apache Celeborn to > support Flink batch job recovery. > *1. RestoreState Invocation* > * The method _*{{ShuffleMaster#restoreState}}*_ should be triggered > regardless of whether the Flink job requires recovery. > * This method signifies that a Flink job needs to restore its state, but it > is currently called only after {_}*{{ShuffleMaster#registerJob}}*{_}. > * Consequently, it might not be invoked if the Flink job does not require > recovery. > * For Celeborn, this creates uncertainty regarding when to initialize > certain components; if the initialization occurs during > {*}_{{registerJob}}_{*}, it may lack essential information from the stored > snapshot, whereas if it takes place during {*}_{{restoreState}}_{*}, there is > a risk that it may not be invoked at all. > *2. JobID Information Requirement* > * Several methods in _*{{ShuffleMaster}}*_ should include _*JobID*_ > information: {*}_{{ShuffleMaster#supportsBatchSnapshot}}_{*}, > {_}*{{ShuffleMaster#snapshotState}}*{_}, and > {_}*{{ShuffleMaster#restoreState}}*{_}. > * These methods are intended for job-granularity state storage and > restoration, but they currently do not incorporate JobID. > * Consequently, Celeborn is unable to determine which job triggered these > calls. > {*}3. Cluster granularity store/restore state{*}: > * Presently, _*{{ShuffleMaster}}*_ only offers job-granularity interfaces > for storing and restoring state, as the _*{{NettyShuffleService}}*_ is > stateless in terms of cluster granularity. > * However, _*{{Celeborn#ShuffleMaster}}*_ needs to communicate with the > Celeborn Master, necessitating the storage of certain cluster-level states, > such as {_}*{{CelebornAppId}}*{_}. > * In my opinion, the cluster-granularity store state interface can be > execute after {_}*{{ShuffleMaster#start}}*{_}, and > _*{{ShuffleMaster#start}}*_ adding a snapshot parameter to restore the > cluster state. -- This message was sent by Atlassian Jira (v8.20.10#820010)