We run a standalone Flink cluster in session mode (but we usually only run one job per cluster; session mode just fits better with our deployment workflow than application mode). We trigger hourly savepoints and also use savepoints to stop a job and then restart with a new version of the jar. I haven’t seen any issue with the hourly savepoints (without stopping the job). For these, I can see messages such as Evicted result with trigger id 30f9457373eba7b9de1bdeaf591a6956 because its TTL of 300s has expired. ~5 minutes after savepoint completion.
When the stop-with-savepoint status lookup fails with Exception occurred in REST handler: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 I still see Evicted result with trigger id cee5054245598efb42245b3046a6ae75 because its TTL of 300s has expired. ~5 minutes after savepoint completion. The documentation<https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#api> for Flink 1.15 mentions a new feature: For (stop-with-)savepoint operations you can control this triggerId by setting it in the body of the request that triggers the operation. This allow you to safely* retry such operations without triggering multiple savepoints. Could this have anything to do with the error I am seeing? Peter Westermann Analytics Software Architect [cidimage001.jpg@01D78D4C.C00AC080] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cidimage001.jpg@01D78D4C.C00AC080] [cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/> From: Chesnay Schepler <ches...@apache.org> Date: Thursday, June 16, 2022 at 11:32 AM To: Peter Westermann <no.westerm...@genesys.com>, user@flink.apache.org <user@flink.apache.org> Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15 EXTERNAL EMAIL - Please use caution with links and attachments ________________________________ ok that shouldn't happen. I couldn't find anything wrong in the code so far; will continue trying to reproduce it. If this happens, does it persist indefinitely for a particular triggerId, or does it reappear later on again? Are you only ever triggering a single savepoint for a given job? Are you using session or application clusters? On 16/06/2022 16:59, Peter Westermann wrote: If it happens it happens immediately. Once we receive the triggerId from /jobs/:jobid/stop or /jobs/:jobid/savepoints we poll /jobs/:jobid/savepoints/:triggerid every second until the status is no longer IN_PROGRESS. Peter Westermann Analytics Software Architect [cid:image003.jpg@01D88178.3859FDB0] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cid:image003.jpg@01D88178.3859FDB0] [cid:image004.jpg@01D88178.3859FDB0]<http://www.genesys.com/> From: Chesnay Schepler <ches...@apache.org><mailto:ches...@apache.org> Date: Thursday, June 16, 2022 at 10:55 AM To: Peter Westermann <no.westerm...@genesys.com><mailto:no.westerm...@genesys.com>, user@flink.apache.org<mailto:user@flink.apache.org> <user@flink.apache.org><mailto:user@flink.apache.org> Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15 EXTERNAL EMAIL - Please use caution with links and attachments ________________________________ There is an expected case where this might happen: if too much time has elapsed since the savepoint was completed (default 5 minutes; controlled by rest.async.store-duration) Did this happen earlier than that? On 16/06/2022 15:53, Peter Westermann wrote: We recently upgraded one of our Flink clusters to version 1.15.0 and are now seeing sporadic issues when stopping a job with a savepoint via the REST API. This happens for /jobs/:jobid/savepoints and /jobs/:jobid/stop: The job finishes with a savepoint but the triggerId returned from the REST API seems to be invalid. Any lookups via /jobs/:jobid/savepoints/:triggerid fail with a 404 and the following error: org.apache.flink.runtime.rest.handler.RestHandlerException: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 for job 0995a9461f0178294ea71c9accbe750c Peter Westermann Analytics Software Architect [cidimage001.jpg@01D78D4C.C00AC080] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cidimage001.jpg@01D78D4C.C00AC080] [cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>