[jira] [Updated] (FLINK-36873) Adapting batch job progress recovery to Apache Celeborn

ASF GitHub Bot (Jira) Tue, 17 Dec 2024 04:35:34 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated FLINK-36873:
-----------------------------------
    Labels: pull-request-available  (was: )

> Adapting batch job progress recovery to Apache Celeborn
> -------------------------------------------------------
>
>                 Key: FLINK-36873
>                 URL: https://issues.apache.org/jira/browse/FLINK-36873
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: xuhuang
>            Assignee: Junrui Lee
>            Priority: Major
>              Labels: pull-request-available
>
> I've identified several issues while attempting to enable Apache Celeborn to 
> support Flink batch job recovery.
> *1. RestoreState Invocation*
>  * The method _*{{ShuffleMaster#restoreState}}*_ should be triggered 
> regardless of whether the Flink job requires recovery.
>  * This method signifies that a Flink job needs to restore its state, but it 
> is currently called only after {_}*{{ShuffleMaster#registerJob}}*{_}.
>  * Consequently, it might not be invoked if the Flink job does not require 
> recovery.
>  * For Celeborn, this creates uncertainty regarding when to initialize 
> certain components; if the initialization occurs during 
> {*}_{{registerJob}}_{*}, it may lack essential information from the stored 
> snapshot, whereas if it takes place during {*}_{{restoreState}}_{*}, there is 
> a risk that it may not be invoked at all.
> *2. JobID Information Requirement* 
>  * Several methods in _*{{ShuffleMaster}}*_ should include _*JobID*_ 
> information: {*}_{{ShuffleMaster#supportsBatchSnapshot}}_{*}, 
> {_}*{{ShuffleMaster#snapshotState}}*{_}, and 
> {_}*{{ShuffleMaster#restoreState}}*{_}.
>  * These methods are intended for job-granularity state storage and 
> restoration, but they currently do not incorporate JobID.
>  * Consequently, Celeborn is unable to determine which job triggered these 
> calls.
> {*}3. Cluster granularity store/restore state{*}:
>  * Presently, _*{{ShuffleMaster}}*_ only offers job-granularity interfaces 
> for storing and restoring state, as the _*{{NettyShuffleService}}*_ is 
> stateless in terms of cluster granularity.
>  * However, _*{{Celeborn#ShuffleMaster}}*_ needs to communicate with the 
> Celeborn Master, necessitating the storage of certain cluster-level states, 
> such as {_}*{{CelebornAppId}}*{_}.
>  * In my opinion, the cluster-granularity store state interface can be 
> execute after {_}*{{ShuffleMaster#start}}*{_}, and 
> _*{{ShuffleMaster#start}}*_ adding a snapshot parameter to restore the 
> cluster state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36873) Adapting batch job progress recovery to Apache Celeborn

Reply via email to