xuhuang created FLINK-36873:
-------------------------------
Summary: Adapting batch job progress recovery to Apache Celeborn
Key: FLINK-36873
URL: https://issues.apache.org/jira/browse/FLINK-36873
Project: Flink
Issue Type: Improvement
Components: Runtime / Network
Reporter: xuhuang
I've identified several issues while attempting to enable Apache Celeborn to
support Flink job recovery.
# *Restore State Invocation*
** The method _*{{ShuffleMaster#restoreState}}*_ should be triggered
regardless of whether the Flink job requires recovery.
** This method signifies that a Flink job needs to restore its state, but it
is currently called only after {_}*{{ShuffleMaster#registerJob}}*{_}.
** Consequently, it might not be invoked if the Flink job does not require
recovery.
** For Celeborn, this creates uncertainty regarding when to initialize certain
components; if the initialization occurs during {*}_{{registerJob}}_{*}, it may
lack essential information from the stored snapshot, whereas if it takes place
during {*}_{{restoreState}}_{*}, there is a risk that it may not be invoked at
all.
# *JobID Information Requirement*
** Several methods in _*{{ShuffleMaster}}*_ should include _*JobID*_
information: {*}_{{ShuffleMaster#supportsBatchSnapshot}}_{*},
{_}*{{ShuffleMaster#snapshotState}}*{_}, and
{_}*{{ShuffleMaster#restoreState}}*{_}.
** These methods are intended for job-granularity state storage and
restoration, but they currently do not incorporate JobID.
** Consequently, Celeborn is unable to determine which job triggered these
calls.
# *Hybrid Shuffle support batch job recovery*
** The hybrid shuffle implementation supports batch job recovery; however, as
it stands, Flink's hybrid shuffle does not facilitate this.
** Because the Celeborn integration with Flink's hybrid shuffle utilizes
{{{}*_Flink#NettyShuffleService_*{}}}, and since that service does not support
batch job recovery for hybrid shuffle, the integration faces limitations.
# {*}Cluster Granularity Store/Restore State{*}:
** Presently, _*{{ShuffleMaster}}*_ only offers job-granularity interfaces for
storing and restoring state, as the _*{{NettyShuffleService}}*_ is stateless in
terms of cluster granularity.
** However, _*{{Celeborn#ShuffleMaster}}*_ needs to communicate with the
Celeborn Master, necessitating the storage of certain cluster-level states,
such as {_}*{{CelebornAppId}}*{_}.
** In my opinion, the cluster-granularity store state interface can be execute
after {_}*{{ShuffleMaster#start}}*{_}, and _*{{ShuffleMaster#start}}*_ adding a
snapshot parameter to restore the cluster state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)