Hey, I have encountered a weird issue in a checkpointing test I am trying to write. The logic is the same as with the previous checkpointing tests, there is a OnceFailingReducer.
My problem is that before the reducer fails, my job cannot take any snapshots. The Runnables executing the checkpointing logic in the sources keep waiting on some lock. After the failure and the restart, everything is fine and the checkpointing can succeed properly. Also if I remove the failure from the reducer, the job doesnt take any snapshots (waiting on lock) and the job will finish. Here is the code: https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83 I assume there is no problem with the source as the Thread.sleep(..) is outside of the synchronized block. (and as I said after the failure it works fine). Any ideas? Thanks, Gyula