[ https://issues.apache.org/jira/browse/FLINK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liting liu updated FLINK-37013: ------------------------------- Description: I found the jobmanager failed to restart the job on time, here are some of my log: {code:java} // code placeholder 2024-12-30 18:50:32,089 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb). 2025-01-06 03:08:25,556 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING. 2025-01-06 03:08:26,746 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 77ae983fe1ccdb4b6a8dffdd3754fcdb: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] 2025-01-06 03:10:02,193 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [530442f543a64060c85150a407a1b3e7]. 2025-01-06 03:10:02,198 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [e43037827cef57edecf298e1378075dd]. {code} The restart stragety is maxNumberRestartAttempts=15, backoffTimeMS=30000, the job switch to RESTARTING at 2025-01-06 03:08:25,556, but didn't restart at 30s later as expected. was: I found the jobmanager failed to restart the job on time, here are some of my log: {code:java} // code placeholder 2024-12-30 18:50:32,089 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb). 2025-01-06 03:08:25,556 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING. 2025-01-06 03:08:26,746 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 77ae983fe1ccdb4b6a8dffdd3754fcdb: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] 2025-01-06 03:10:02,193 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [530442f543a64060c85150a407a1b3e7]. 2025-01-06 03:10:02,198 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [e43037827cef57edecf298e1378075dd]. {code} > Job failed to restart on time > ----------------------------- > > Key: FLINK-37013 > URL: https://issues.apache.org/jira/browse/FLINK-37013 > Project: Flink > Issue Type: Bug > Components: API / DataStream > Affects Versions: 1.15.4 > Reporter: liting liu > Priority: Major > > I found the jobmanager failed to restart the job on time, here are some of my > log: > {code:java} > // code placeholder > 2024-12-30 18:50:32,089 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Using restart back off time strategy > FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=15, > backoffTimeMS=30000) for XXX (77ae983fe1ccdb4b6a8dffdd3754fcdb). > 2025-01-06 03:08:25,556 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - XXX > (77ae983fe1ccdb4b6a8dffdd3754fcdb) switched from state RUNNING to RESTARTING. > 2025-01-06 03:08:26,746 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > 77ae983fe1ccdb4b6a8dffdd3754fcdb: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=1}] > 2025-01-06 03:10:02,193 INFO > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - > Releasing slot [530442f543a64060c85150a407a1b3e7]. > 2025-01-06 03:10:02,198 INFO > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - > Releasing slot [e43037827cef57edecf298e1378075dd]. > {code} > The restart stragety is maxNumberRestartAttempts=15, backoffTimeMS=30000, the > job switch to RESTARTING at 2025-01-06 03:08:25,556, but didn't restart at > 30s later as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)