I just got back to this issue. The problem wasn't with the locking but that the StreamTask wasn't in running state before the first checkpoint trigger message. I actually just saw your JIRA as well, funny... :)
Regards, Gyula Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P, 15:36): > Hmm, strange issue indeed. > > So, checkpoints are definitely triggered (log message by coordinator to > trigger checkpoint) but are not completing? > Can you check which is the first checkpoint to complete? Is it Checkpoint > 1, or a later one (indicating that checkpoint 1 was somehow subsumed). > > Can you check in the stacktrace on which lock the checkpoint runables are > waiting, and who is holding that lock? > > Two thoughts: > > 1) What I mistakenly did once in one of my tests is to have the sleep() in > a downstream task. That would simply prevent the fast generated data > elements (and the inline checkpoint barriers) from passing though and > completing the checkpoint. > > 2) Is this another issue with the non-fair lock? Does the checkpoint > runnable simply not get the lock before the checkpoint. Not sure why it > would suddenly work after the failure. We could try and swap the lock > Object by a "ReentrantLock(true)" and see what would happen. > > > Stephan > > > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote: > > > Hey, > > > > I have encountered a weird issue in a checkpointing test I am trying to > > write. The logic is the same as with the previous checkpointing tests, > > there is a OnceFailingReducer. > > > > My problem is that before the reducer fails, my job cannot take any > > snapshots. The Runnables executing the checkpointing logic in the sources > > keep waiting on some lock. > > > > After the failure and the restart, everything is fine and the > checkpointing > > can succeed properly. > > > > Also if I remove the failure from the reducer, the job doesnt take any > > snapshots (waiting on lock) and the job will finish. > > > > Here is the code: > > > > > https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83 > > > > I assume there is no problem with the source as the Thread.sleep(..) is > > outside of the synchronized block. (and as I said after the failure it > > works fine). > > > > Any ideas? > > > > Thanks, > > Gyula > > >