[ 
https://issues.apache.org/jira/browse/FLINK-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123284#comment-15123284
 ] 

Maximilian Michels commented on FLINK-3300:
-------------------------------------------

Thanks for reporting. Quick fix would be to just initialize the 
ContainerLaunchContext before bringing up the AsyncClient. Reverting IMHO is 
not an option because the old implementation could also cause problems like 
blocking the actor on allocation of containers.

+1 for allocating/removing containers in the YarnJobManager actor and sending 
messages from the AsyncClient to trigger that.

> Concurrency Bug in Yarn JobManager
> ----------------------------------
>
>                 Key: FLINK-3300
>                 URL: https://issues.apache.org/jira/browse/FLINK-3300
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.0.0
>            Reporter: Stephan Ewen
>            Assignee: Maximilian Michels
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> The change to use the async ResourceManager client introduced concurrency 
> problems: The ResourceManager callback threads run and change data structures 
> at the same time as the actor methods, voiding the actor concurrency model.
> One example that can happen is that the callback tries to start containers 
> while the ContainerLaunchContext is still not set (because the actor method 
> is still in the StartYarnSession method).
> Bug introducing commit: 
> https://github.com/apache/flink/commit/4e52fe4304566e5239996b3d48290e0c1f0772e8
> Quick fix could be to revert the commit. Better solution would be to let the 
> callback methods send actor messages to the YobManager, rather than directly 
> acting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to