[ 
https://issues.apache.org/jira/browse/FLINK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liting liu updated FLINK-37013:
-------------------------------
    Description: 
I found the jobmanager failed to restart the job on time, here are some of my 
log:
{code:java}
// code placeholder
2024-12-30 18:50:32,089 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Using restart back off time strategy 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, 
backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb).
2025-01-06 03:08:25,556 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - XXX 
(77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING.
2025-01-06 03:08:26,746 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 77ae983fe1ccdb4b6a8dffdd3754fcdb: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}] 
2025-01-06 03:10:02,193 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [530442f543a64060c85150a407a1b3e7].
2025-01-06 03:10:02,198 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [e43037827cef57edecf298e1378075dd].

 {code}
The restart stragety is maxNumberRestartAttempts=15, backoffTimeMS=30000, the 
job switch to RESTARTING at 2025-01-06 03:08:25,556, but didn't restart  at 30s 
later as expected. 

  was:
I found the jobmanager failed to restart the job on time, here are some of my 
log:
{code:java}
// code placeholder
2024-12-30 18:50:32,089 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Using restart back off time strategy 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, 
backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb).
2025-01-06 03:08:25,556 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - XXX 
(77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING.
2025-01-06 03:08:26,746 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 77ae983fe1ccdb4b6a8dffdd3754fcdb: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}] 
2025-01-06 03:10:02,193 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [530442f543a64060c85150a407a1b3e7].
2025-01-06 03:10:02,198 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [e43037827cef57edecf298e1378075dd].

 {code}
 


> Job failed to restart on time
> -----------------------------
>
>                 Key: FLINK-37013
>                 URL: https://issues.apache.org/jira/browse/FLINK-37013
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream
>    Affects Versions: 1.15.4
>            Reporter: liting liu
>            Priority: Major
>
> I found the jobmanager failed to restart the job on time, here are some of my 
> log:
> {code:java}
> // code placeholder
> 2024-12-30 18:50:32,089 INFO  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Using restart back off time strategy 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, 
> backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb).
> 2025-01-06 03:08:25,556 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - XXX 
> (77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING.
> 2025-01-06 03:08:26,746 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> 77ae983fe1ccdb4b6a8dffdd3754fcdb: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=1}] 
> 2025-01-06 03:10:02,193 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
> Releasing slot [530442f543a64060c85150a407a1b3e7].
> 2025-01-06 03:10:02,198 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
> Releasing slot [e43037827cef57edecf298e1378075dd].
>  {code}
> The restart stragety is maxNumberRestartAttempts=15, backoffTimeMS=30000, the 
> job switch to RESTARTING at 2025-01-06 03:08:25,556, but didn't restart  at 
> 30s later as expected. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to