Hi,
when JM goes down, it should be brought up (if configured as HA, running
on k8s, ...), and it should recover all running jobs. If this does not
happen then it means that:
a) either the JM is not in HA configuration, or
b) it is unable to recover after failure, which typically means tha
I am using the lyft flink operator (in k8s), and it is able to monitor the
submitted job status for us. It shows both cluster and job healthiness.
The issue so far we’ve seen is sometimes the task keep failing and
retrying, but it was not detected by the flink operator. However, the flink
itself co
Hi Jan,
Thank you for your response! Apologies that this wasn't clear, but we're
actually looking at what would happen if the job server *were *to go down.
So what we are more interested in is understanding *how* to actually
monitor that the job is running. We won't know the job id so we can't use
Hi,
if I understand correctly, you have a 'program runner' (sometimes called
a driver), which is supposed to be long-running and watching if the
submitted Pipeline runs or not. If not, then the driver resubmits the
job. If my understanding is correct, I would suggest looking into the
reasons