Re: Sporadic issues with savepoint status lookup in Flink 1.15

Chesnay Schepler Fri, 17 Jun 2022 00:41:00 -0700

We did several changes to the savepoint rest API backend, wheresomething may have snuck in.The odd thing is that you only see the issue for stop-with-savepoint,which are internally handled the same way as savepoints.


On 16/06/2022 17:57, Peter Westermann wrote:

We run a standalone Flink cluster in session mode (but we usually onlyrun one job per cluster; session mode just fits better with ourdeployment workflow than application mode).
We trigger hourly savepoints and also use savepoints to stop a job andthen restart with a new version of the jar.
I haven’t seen any issue with the hourly savepoints (without stoppingthe job). For these, I can see messages such as Evicted result withtrigger id 30f9457373eba7b9de1bdeaf591a6956 because its TTL of 300shas expired.
~5 minutes after savepoint completion.
When the stop-with-savepoint status lookup fails with Exceptionoccurred in REST handler: There is no savepoint operation withtriggerId=cee5054245598efb42245b3046a6ae75
I still see Evicted result with trigger idcee5054245598efb42245b3046a6ae75because its TTL of 300s has expired.~5minutes after savepoint completion.
The documentation<https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#api>for Flink 1.15 mentions a new feature:
/For (stop-with-)savepoint operations you can control this//triggerId// by setting it in the body of the request that triggersthe operation. This allow you to safely* retry such operations withouttriggering multiple savepoints./
Could this have anything to do with the error I am seeing?

Peter Westermann

Analytics Software Architect

cidimage001.jpg@01D78D4C.C00AC080

peter.westerm...@genesys.com <mailto:peter.westerm...@genesys.com>

cidimage001.jpg@01D78D4C.C00AC080

cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>

*From: *Chesnay Schepler <ches...@apache.org>
*Date: *Thursday, June 16, 2022 at 11:32 AM
*To: *Peter Westermann <no.westerm...@genesys.com>,user@flink.apache.org <user@flink.apache.org>
*Subject: *Re: Sporadic issues with savepoint status lookup in Flink 1.15

* EXTERNAL EMAIL - Please use caution with links and attachments *

------------------------------------------------------------------------
ok that shouldn't happen. I couldn't find anything wrong in the codeso far; will continue trying to reproduce it.
If this happens, does it persist indefinitely for a particulartriggerId, or does it reappear later on again?
Are you only ever triggering a single savepoint for a given job?

Are you using session or application clusters?

On 16/06/2022 16:59, Peter Westermann wrote:

    If it happens it happens immediately. Once we receive the
    triggerId from */jobs/:jobid/stop *or*/jobs/:jobid/savepoints* we
    poll */jobs/:jobid/savepoints/:triggerid *every second until the
    status is no longer IN_PROGRESS.

    Peter Westermann

    Analytics Software Architect

    peter.westerm...@genesys.com <mailto:peter.westerm...@genesys.com>

    <http://www.genesys.com/>

    *From: *Chesnay Schepler <ches...@apache.org>
    <mailto:ches...@apache.org>
    *Date: *Thursday, June 16, 2022 at 10:55 AM
    *To: *Peter Westermann <no.westerm...@genesys.com>
    <mailto:no.westerm...@genesys.com>, user@flink.apache.org
    <user@flink.apache.org> <mailto:user@flink.apache.org>
    *Subject: *Re: Sporadic issues with savepoint status lookup in
    Flink 1.15

    * EXTERNAL EMAIL - Please use caution with links and attachments *

    ------------------------------------------------------------------------

    There is an expected case where this might happen:

    if too much time has elapsed since the savepoint was completed
    (default 5 minutes; controlled by rest.async.store-duration)

    Did this happen earlier than that?

    On 16/06/2022 15:53, Peter Westermann wrote:

        We recently upgraded one of our Flink clusters to version
        1.15.0 and are now seeing sporadic issues when stopping a job
        with a savepoint via the REST API. This happens for
        */jobs/:jobid/savepoints *and*/jobs/:jobid/stop*:

        The job finishes with a savepoint but the triggerId returned
        from the REST API seems to be invalid. Any lookups via
        */jobs/:jobid/savepoints/:triggerid* fail with a 404 and the
        following error:

        org.apache.flink.runtime.rest.handler.RestHandlerException:
        There is no savepoint operation with
        triggerId=cee5054245598efb42245b3046a6ae75 for job
        0995a9461f0178294ea71c9accbe750c

        Peter Westermann

        Analytics Software Architect

        cidimage001.jpg@01D78D4C.C00AC080

        peter.westerm...@genesys.com <mailto:peter.westerm...@genesys.com>

        cidimage001.jpg@01D78D4C.C00AC080

        cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Reply via email to