We run a standalone Flink cluster in session mode (but we usually only run one 
job per cluster; session mode just fits better with our deployment workflow 
than application mode).
We trigger hourly savepoints and also use savepoints to stop a job and then 
restart with a new version of the jar.
I haven’t seen any issue with the hourly savepoints (without stopping the job). 
 For these, I can see messages such as Evicted result with trigger id 
30f9457373eba7b9de1bdeaf591a6956 because its TTL of 300s has expired.
~5 minutes after savepoint completion.

When the stop-with-savepoint status lookup fails with Exception occurred in 
REST handler: There is no savepoint operation with 
triggerId=cee5054245598efb42245b3046a6ae75
I still see Evicted result with trigger id cee5054245598efb42245b3046a6ae75 
because its TTL of 300s has expired. ~5 minutes after savepoint completion.

The 
documentation<https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#api>
 for Flink 1.15 mentions a new feature:
For (stop-with-)savepoint operations you can control this triggerId by setting 
it in the body of the request that triggers the operation. This allow you to 
safely* retry such operations without triggering multiple savepoints.

Could this have anything to do with the error I am seeing?



Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>


From: Chesnay Schepler <ches...@apache.org>
Date: Thursday, June 16, 2022 at 11:32 AM
To: Peter Westermann <no.westerm...@genesys.com>, user@flink.apache.org 
<user@flink.apache.org>
Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15
 EXTERNAL EMAIL - Please use caution with links and attachments

________________________________
ok that shouldn't happen. I couldn't find anything wrong in the code so far; 
will continue trying to reproduce it.

If this happens, does it persist indefinitely for a particular triggerId, or 
does it reappear later on again?
Are you only ever triggering a single savepoint for a given job?

Are you using session or application clusters?

On 16/06/2022 16:59, Peter Westermann wrote:
If it happens it happens immediately. Once we receive the triggerId from 
/jobs/:jobid/stop or /jobs/:jobid/savepoints we poll 
/jobs/:jobid/savepoints/:triggerid every second until the status is no longer 
IN_PROGRESS.

Peter Westermann
Analytics Software Architect
[cid:image003.jpg@01D88178.3859FDB0]
peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com>
[cid:image003.jpg@01D88178.3859FDB0]
[cid:image004.jpg@01D88178.3859FDB0]<http://www.genesys.com/>


From: Chesnay Schepler <ches...@apache.org><mailto:ches...@apache.org>
Date: Thursday, June 16, 2022 at 10:55 AM
To: Peter Westermann 
<no.westerm...@genesys.com><mailto:no.westerm...@genesys.com>, 
user@flink.apache.org<mailto:user@flink.apache.org> 
<user@flink.apache.org><mailto:user@flink.apache.org>
Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15
 EXTERNAL EMAIL - Please use caution with links and attachments

________________________________
There is an expected case where this might happen:
if too much time has elapsed since the savepoint was completed (default 5 
minutes; controlled by rest.async.store-duration)

Did this happen earlier than that?

On 16/06/2022 15:53, Peter Westermann wrote:
We recently upgraded one of our Flink clusters to version 1.15.0 and are now 
seeing sporadic issues when stopping a job with a savepoint via the REST API. 
This happens for /jobs/:jobid/savepoints and /jobs/:jobid/stop:
The job finishes with a savepoint but the triggerId returned from the REST API 
seems to be invalid. Any lookups via /jobs/:jobid/savepoints/:triggerid fail 
with a 404 and the following error:

org.apache.flink.runtime.rest.handler.RestHandlerException: There is no 
savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 for job 
0995a9461f0178294ea71c9accbe750c


Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>





Reply via email to