GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/310
[FLINK-1376] [runtime] Add proper shared slot release in case of a fatal TaskManager failure This PR introduces SharedSlots as being a special Slot type and as such being released properly in case an Instance has been marked dead. This fixes the problem that a dead instance, which has not been shutdown properly, causes a job not being removed properly from the system, because it is not aware of the SubSlots. Adds test cases where only the heartbeat thread of TaskManager is killed. Except for the test cases, this is basically the same PR as #309 just rebased on the current 0.8 release candidate. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixSharedSlotReleaseRC2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 ---- commit 2935a7ee19eddb48efb38a3a65c4afe5e1bba0d2 Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-01-12T09:58:45Z [FLINK-1376] [runtime] Add proper shared slot release in case of a fatal TaskManager failure. Fixes concurrent modification exception of SharedSlot's subSlots field by synchronizing all state changing operations through the associated assignment group. Fixes deadlock where Instance.markDead first acquires InstanceLock and then by releasing the associated slots the assignment group lockcan block with a direct releaseSlot call on a SharedSlot which first acquires the assignment group lock and then the instance lock in order to return the slot to the instance. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---