Hi Steven,
I think I found the problem. It is caused by a JobMaster which takes a long
time to suspend the job and multiple leader changes. So what happens after
the first leadership revoking and regaining is that the Dispatcher recovers
the submitted job but waits to execute it because the JobMas
Hi Steven,
a quick update from my side after looking through the logs. The problem
seems to be that the Dispatcher does not start recovering the jobs after
regaining the leadership after it lost it before. I cannot yet tell why
this is happening and I try to further debug the problem.
If you mana
Till,
I will send you the complete log offline. We don't know how to reliably
reproduce the problem. but it did happen quite frequently, like once every
a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
branch.
Thanks,
Steven
On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann w
Hi Steven,
is this the tail of the logs or are there other statements following? I
think your problem could indeed be related to FLINK-11537. Is it possible
to somehow reliably reproduce this problem? If yes, then you could try out
the RC for Flink 1.8.0 which should be published in the next days.
We have observe that sometimes job stuck in suspended state, and no job
restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably c