[jira] [Created] (FLINK-36873) Adapting batch job progress recovery to Apache Celeborn

xuhuang (Jira) Mon, 09 Dec 2024 20:12:06 -0800

xuhuang created FLINK-36873:
-------------------------------

             Summary: Adapting batch job progress recovery to Apache Celeborn
                 Key: FLINK-36873
                 URL: https://issues.apache.org/jira/browse/FLINK-36873
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Network
            Reporter: xuhuang



I've identified several issues while attempting to enable Apache Celeborn to 
support Flink job recovery.
 # *Restore State Invocation*

 ** The method _*{{ShuffleMaster#restoreState}}*_ should be triggered 
regardless of whether the Flink job requires recovery.

 ** This method signifies that a Flink job needs to restore its state, but it 
is currently called only after {_}*{{ShuffleMaster#registerJob}}*{_}.

 ** Consequently, it might not be invoked if the Flink job does not require 
recovery.

 ** For Celeborn, this creates uncertainty regarding when to initialize certain 
components; if the initialization occurs during {*}_{{registerJob}}_{*}, it may 
lack essential information from the stored snapshot, whereas if it takes place 
during {*}_{{restoreState}}_{*}, there is a risk that it may not be invoked at 
all.

 # *JobID Information Requirement* 

 ** Several methods in _*{{ShuffleMaster}}*_ should include _*JobID*_ 
information: {*}_{{ShuffleMaster#supportsBatchSnapshot}}_{*}, 
{_}*{{ShuffleMaster#snapshotState}}*{_}, and 
{_}*{{ShuffleMaster#restoreState}}*{_}.

 ** These methods are intended for job-granularity state storage and 
restoration, but they currently do not incorporate JobID.

 ** Consequently, Celeborn is unable to determine which job triggered these 
calls.

 # *Hybrid Shuffle support batch job recovery* 

 ** The hybrid shuffle implementation supports batch job recovery; however, as 
it stands, Flink's hybrid shuffle does not facilitate this.

 ** Because the Celeborn integration with Flink's hybrid shuffle utilizes 
{{{}*_Flink#NettyShuffleService_*{}}}, and since that service does not support 
batch job recovery for hybrid shuffle, the integration faces limitations.

 # {*}Cluster Granularity Store/Restore State{*}:

 ** Presently, _*{{ShuffleMaster}}*_ only offers job-granularity interfaces for 
storing and restoring state, as the _*{{NettyShuffleService}}*_ is stateless in 
terms of cluster granularity.

 ** However, _*{{Celeborn#ShuffleMaster}}*_ needs to communicate with the 
Celeborn Master, necessitating the storage of certain cluster-level states, 
such as {_}*{{CelebornAppId}}*{_}.

 ** In my opinion, the cluster-granularity store state interface can be execute 
after {_}*{{ShuffleMaster#start}}*{_}, and _*{{ShuffleMaster#start}}*_ adding a 
snapshot parameter to restore the cluster state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36873) Adapting batch job progress recovery to Apache Celeborn

Reply via email to