Ok, looks like we've found the cause of this issue. The scenario looks like
this:
1. The queue is full (let's assume that its capacity is N elements)
2. There is some pending element waiting, so the
pendingStreamElementQueueEntry field in AsyncWaitOperator is not null and
while-loop in addAsyncBufferEntry method is trying to add this element to
the queue (but element is not added because queue is full)
3. Now the snapshot is taken - the whole queue of N elements is being
written into the ListState in snapshotState method and also (what is more
important) this pendingStreamElementQueueEntry is written to this list too. 
4. The process is being restarted, so it tries to recover all the elements
and put them again into the queue, but the list of recovered elements hold
N+1 element and our queue capacity is only N. Process is not started yet, so
it can not process any element and this one element is waiting endlessly.
But it's never added and the process will never process anything. Deadlock.
5. Trigger is fired and indeed discarded because the process is not running
yet.

If something is unclear in my description - please let me know. We will also
try to reproduce this bug in some unit test and then report Jira issue.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to