I just got back to this issue. The problem wasn't with the locking but that
the StreamTask wasn't in running state before the first checkpoint trigger
message.
I actually just saw your JIRA as well, funny... :)

Regards,
Gyula

Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P, 15:36):

> Hmm, strange issue indeed.
>
> So, checkpoints are definitely triggered (log message by coordinator to
> trigger checkpoint) but are not completing?
> Can you check which is the first checkpoint to complete? Is it Checkpoint
> 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
>
> Can you check in the stacktrace on which lock the checkpoint runables are
> waiting, and who is holding that lock?
>
> Two thoughts:
>
> 1) What I mistakenly did once in one of my tests is to have the sleep() in
> a downstream task. That would simply prevent the fast generated data
> elements (and the inline checkpoint barriers) from passing though and
> completing the checkpoint.
>
> 2) Is this another issue with the non-fair lock? Does the checkpoint
> runnable simply not get the lock before the checkpoint. Not sure why it
> would suddenly work after the failure. We could try and swap the lock
> Object by a "ReentrantLock(true)" and see what would happen.
>
>
> Stephan
>
>
> On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote:
>
> > Hey,
> >
> > I have encountered a weird issue in a checkpointing test I am trying to
> > write. The logic is the same as with the previous checkpointing tests,
> > there is a OnceFailingReducer.
> >
> > My problem is that before the reducer fails, my job cannot take any
> > snapshots. The Runnables executing the checkpointing logic in the sources
> > keep waiting on some lock.
> >
> > After the failure and the restart, everything is fine and the
> checkpointing
> > can succeed properly.
> >
> > Also if I remove the failure from the reducer, the job doesnt take any
> > snapshots (waiting on lock) and the job will finish.
> >
> > Here is the code:
> >
> >
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
> >
> > I assume there is no problem with the source as the Thread.sleep(..) is
> > outside of the synchronized block. (and as I said after the failure it
> > works fine).
> >
> > Any ideas?
> >
> > Thanks,
> > Gyula
> >
>

Reply via email to