Just to double-check that I'm understanding things correctly:
You have a job with HA, then Zookeeper breaks down, the job gets
suspended, ZK comes back online, and the _same_ JobManager becomes the
leader?
If so, then I can explain why this happens and hopefully reproduce it.
In short, when a job is suspended (or terminates in any other way) then
information about the job is stored in a data-structure.
This is used by the REST API (and thus, UI) to query completed jobs.
For _running_ jobs we query the JobMaster (a component within the
JobManager responsible for that job) instead.
When listing all jobs we query all jobs from the data-structure for
finished jobs, _and_ all JobMasters for running jobs. The core
assumption here is that for a given ID only one of these can return
something.
So what ends up happening is that when the job is suspended it is
written to the data-structure, and then another JobMaster is started for
the same job, and when listing all jobs we can now end up asking for the
same job from multiple sources.
This is a somewhat unusual scenario because usually when a job is
suspended another JobManager becomes the leader (where this wouldn't
occur because the data-structure isn't shared across JobManagers).
On 09/09/2021 13:37, Peter Westermann wrote:
Hi Piotr,
Jobmanager logs are attached to this email. The only thing that jumps
out to me is this:
09/08/2021 09:02:26.240 -0400 ERROR
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.
java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
This happened days after the Flink update – and not just once. Across
all our Flink clusters I’ve seen this 3 times. The cause for the
jobmanager leadership loss in this case was a deployment of our
zookeeper cluster that lead to a brief connection loss. The new leader
election is expected.
Thanks,
Peter
*From: *Piotr Nowojski <pnowoj...@apache.org>
*Date: *Thursday, September 9, 2021 at 12:39 AM
*To: *Peter Westermann <no.westerm...@genesys.com>
*Cc: *user@flink.apache.org <user@flink.apache.org>
*Subject: *Re: Duplicate copies of job in Flink UI/API
Hi Peter,
Can you provide relevant JobManager logs? And can you write down what
steps have you taken before the failure happened? Did this
failure occur during upgrading Flink, or after the upgrade etc.
Best,
Piotrek
śr., 8 wrz 2021 o 16:11 Peter Westermann <no.westerm...@genesys.com
<mailto:no.westerm...@genesys.com>> napisał(a):
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing
some weird behavior after a change in jobmanager leadership: We’re
seeing two copies of the same job, one of those is in SUSPENDED
state and has a start time of zero. Here’s the output from the
/jobs/overview endpoint:
{
"jobs": [{
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "RUNNING",
"start-time": 1631106146284,
"end-time": -1,
"duration": 2954642,
"last-modification": 1631106152322,
"tasks": {
"total": 112,
"created": 0,
"scheduled": 0,
"deploying": 0,
"running": 112,
"finished": 0,
"canceling": 0,
"canceled": 0,
"failed": 0,
"reconciling": 0
}
}, {
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "SUSPENDED",
"start-time": 0,
"end-time": -1,
"duration": 1631105900760,
"last-modification": 0,
"tasks": {
"total": 0,
"created": 0,
"scheduled": 0,
"deploying": 0,
"running": 0,
"finished": 0,
"canceling": 0,
"canceled": 0,
"failed": 0,
"reconciling": 0
}
}]
}
Has anyone seen this behavior before?
Thanks,
Peter