[ 
https://issues.apache.org/jira/browse/FLINK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430870#comment-15430870
 ] 

ASF GitHub Bot commented on FLINK-4348:
---------------------------------------

Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2389
  
    Thanks for your contribution @beyond1920 :-) I've reviewed the PR and I 
think it would be good if we split it up into several parts. The first part 
could be the heartbeat logic a.k.a. `HeartbeatManager`.   Here we could try to 
implement a generic sending and receiving end. I think the implementation can 
almost be independent of the JM, RM and TE implementation (similar to 
`RetryingRegistration`). This will allow us to easily test this component.
    
    The next step would be the integration of this component into the RM, JM 
and TE.
    
    Concerning the slot request logic I think we should wait a little bit for 
the `SlotManager` implementation. It could be the case that the `SlotManager` 
will make the rpcs to the `TaskExecutor` and not the RM. But for the moment the 
interface is, afaik, not well enough specified to program against it.
    
    The failure notification should also be treated in a separate PR imo. The 
notification can have multiple origins (e.g. `HeartbeatManager` or the resource 
management framework) and should be designed in such a way.
    
    In general, I think the components should be more thoroughly tested with 
more fine-grained unit tests. Furthermore, I think it would be good if we could 
revise the code documentation a little bit.


> Implement communication from ResourceManager to TaskManager
> -----------------------------------------------------------
>
>                 Key: FLINK-4348
>                 URL: https://issues.apache.org/jira/browse/FLINK-4348
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: Kurt Young
>            Assignee: zhangjing
>
> There are mainly 3 logics initiated from RM to TM:
> * Heartbeat, RM use heartbeat to sync with TM's slot status
> * SlotRequest, when RM decides to assign slot to JM, should first try to send 
> request to TM for slot. TM can either accept or reject this request.
> * FailureNotify, in some corner cases, TM will be marked as invalid by 
> cluster manager master(e.g. yarn master), but TM itself does not realize. RM 
> should send failure notify to TM and TM can terminate itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to