Re: Duplicate copies of job in Flink UI/API

Chesnay Schepler Thu, 09 Sep 2021 07:06:20 -0700

I'm afraid there's no real workaround.

If the information for completed jobs isn't important to you thensetting jobstore.expiration-time to a low value can reduce the impact,or setting jobstore.max-capacity to 0 would prevent any completed jobfrom being displayed.


Beyond that I can't think of anything.

We will track the issue in this ticket:https://issues.apache.org/jira/browse/FLINK-20195

I will also file a related ticket that the job archiving fails because,again, we archive multiple jobs with the same job ID.



On 09/09/2021 15:15, Peter Westermann wrote:

Thanks Chesnay. You are understanding this correctly. Your explanationmakes sense to me.

Is there anything we can do to prevent this? At least for us, mosttimes a leader election happens, the leader doesn’t actually changebecause the jobmanager is still healthy.


Thanks,

Peter


*From: *Chesnay Schepler <ches...@apache.org>
*Date: *Thursday, September 9, 2021 at 9:11 AM

*To: *Peter Westermann <no.westerm...@genesys.com>, Piotr Nowojski<pnowoj...@apache.org>, user@flink.apache.org <user@flink.apache.org>

*Subject: *Re: Duplicate copies of job in Flink UI/API

Just to double-check that I'm understanding things correctly:

You have a job with HA, then Zookeeper breaks down, the job getssuspended, ZK comes back online, and the _same_ JobManager becomes theleader?


If so, then I can explain why this happens and hopefully reproduce it.

In short, when a job is suspended (or terminates in any other way)then information about the job is stored in a data-structure.


This is used by the REST API (and thus, UI) to query completed jobs.

For _running_ jobs we query the JobMaster (a component within theJobManager responsible for that job) instead.

When listing all jobs we query all jobs from the data-structure forfinished jobs, _and_ all JobMasters for running jobs. The coreassumption here is that for a given ID only one of these can returnsomething.

So what ends up happening is that when the job is suspended it iswritten to the data-structure, and then another JobMaster is startedfor the same job, and when listing all jobs we can now end up askingfor the same job from multiple sources.

This is a somewhat unusual scenario because usually when a job issuspended another JobManager becomes the leader (where this wouldn'toccur because the data-structure isn't shared across JobManagers).


On 09/09/2021 13:37, Peter Westermann wrote:

    Hi Piotr,

    Jobmanager logs are attached to this email. The only thing that
    jumps out to me is this:

    09/08/2021 09:02:26.240 -0400 ERROR
    org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.

      java.io.IOException: File already
    exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb

    This happened days after the Flink update – and not just once.
    Across all our Flink clusters I’ve seen this 3 times. The cause
    for the jobmanager leadership loss in this case was a deployment
    of our zookeeper cluster that lead to a brief connection loss. The
    new leader election is expected.

    Thanks,

    Peter

    *From: *Piotr Nowojski <pnowoj...@apache.org>
    <mailto:pnowoj...@apache.org>
    *Date: *Thursday, September 9, 2021 at 12:39 AM
    *To: *Peter Westermann <no.westerm...@genesys.com>
    <mailto:no.westerm...@genesys.com>
    *Cc: *user@flink.apache.org <mailto:user@flink.apache.org>
    <user@flink.apache.org> <mailto:user@flink.apache.org>
    *Subject: *Re: Duplicate copies of job in Flink UI/API

    Hi Peter,

    Can you provide relevant JobManager logs? And can you write down
    what steps have you taken before the failure happened? Did this
    failure occur during upgrading Flink, or after the upgrade etc.

    Best,

    Piotrek

    śr., 8 wrz 2021 o 16:11 Peter Westermann
    <no.westerm...@genesys.com <mailto:no.westerm...@genesys.com>>
    napisał(a):

        We recently upgraded from Flink 1.12.4 to 1.12.5 and are
        seeing some weird behavior after a change in jobmanager
        leadership: We’re seeing two copies of the same job, one of
        those is in SUSPENDED state and has a start time of zero.
        Here’s the output from the /jobs/overview endpoint:

        {

          "jobs": [{

            "jid": "2db4ee6397151a1109d1ca05188a4cbb",

            "name": "analytics-flink-v1",

            "state": "RUNNING",

            "start-time": 1631106146284,

            "end-time": -1,

            "duration": 2954642,

            "last-modification": 1631106152322,

            "tasks": {

              "total": 112,

              "created": 0,

              "scheduled": 0,

              "deploying": 0,

              "running": 112,

              "finished": 0,

              "canceling": 0,

              "canceled": 0,

              "failed": 0,

              "reconciling": 0

            }

          }, {

            "jid": "2db4ee6397151a1109d1ca05188a4cbb",

            "name": "analytics-flink-v1",

            "state": "SUSPENDED",

            "start-time": 0,

            "end-time": -1,

            "duration": 1631105900760,

            "last-modification": 0,

            "tasks": {

              "total": 0,

              "created": 0,

              "scheduled": 0,

              "deploying": 0,

              "running": 0,

              "finished": 0,

              "canceling": 0,

              "canceled": 0,

              "failed": 0,

              "reconciling": 0

            }

          }]

        }

        Has anyone seen this behavior before?

        Thanks,

        Peter

Re: Duplicate copies of job in Flink UI/API

Reply via email to