Re: [1.7.1] job stuck in suspended state

2019-03-06 Thread Till Rohrmann
Hi Steven, I think I found the problem. It is caused by a JobMaster which takes a long time to suspend the job and multiple leader changes. So what happens after the first leadership revoking and regaining is that the Dispatcher recovers the submitted job but waits to execute it because the JobMas

Re: [1.7.1] job stuck in suspended state

2019-03-06 Thread Till Rohrmann
Hi Steven, a quick update from my side after looking through the logs. The problem seems to be that the Dispatcher does not start recovering the jobs after regaining the leadership after it lost it before. I cannot yet tell why this is happening and I try to further debug the problem. If you mana

Re: [1.7.1] job stuck in suspended state

2019-03-04 Thread Steven Wu
Till, I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch. Thanks, Steven On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann w

Re: [1.7.1] job stuck in suspended state

2019-03-04 Thread Till Rohrmann
Hi Steven, is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.

[1.7.1] job stuck in suspended state

2019-03-01 Thread Steven Wu
We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended. * it is a high-parallelism job (like close to 2,000) * there were a few job restarts before this * there were high GC pause during the period * zookeeper timeout. probably c