Till Rohrmann created FLINK-1489:
------------------------------------

             Summary: Failing JobManager due to blocking calls in 
Execution.scheduleOrUpdateConsumers
                 Key: FLINK-1489
                 URL: https://issues.apache.org/jira/browse/FLINK-1489
             Project: Flink
          Issue Type: Bug
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann


[~Zentol] reported that the JobManager failed to execute his python job. The 
reason is that the the JobManager executes blocking calls in the actor thread 
in the method {{Execution.sendUpdateTaskRpcCall}} as a result to receiving a 
{{ScheduleOrUpdateConsumers}} message. 

Every TaskManager possibly sends a {{ScheduleOrUpdateConsumers}} to the 
JobManager to notify the consumers about available data. The JobManager then 
sends to each TaskManager the respective update call 
{{Execution.sendUpdateTaskRpcCall}}. By blocking the actor thread, we 
effectively execute the update calls sequentially. Due to the ever accumulating 
delay, some of the initial timeouts on the TaskManager side in 
{{IntermediateResultParititon.scheduleOrUpdateConsumers}} fail. As a result the 
execution of the respective Tasks fails.

A solution would be to make the call non-blocking.

A general caveat for actor programming is: We should never block the actor 
thread, otherwise we seriously jeopardize the scalability of the system. Or 
even worse, the system simply fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to