[ https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010312#comment-17010312 ]
Zhu Zhu commented on FLINK-15448: --------------------------------- I just met another case that one can even no be able to find the host of a pending/failed TM in logs (FLINK-15499). So I think it would be helpful to print the host of a TM not only in the task deploying stages. Composing the host info into the ResourceID looks to me a better design than spreading host around with ResourceID. There can be 2 defects though: 1. redundancy logs 2. ResourceID size would double and the size of certain RPCs (like heartbeat) may increase. This is a common issue for the work to associate other IDs with meanings, like ExecutionAttemptID and IntermediateResultPartitionID Regarding ResourceID, these 2 defects should not be critical. [~trohrmann] shall we replace ResourceID with a extended class like TaskManagerID? I think using a general ResourceID for both TM/RM/JM is making it not that nonintuitive in development. And with it we can also limit the change to for the extended class at the moment. > Log host informations for TaskManager failures. > ----------------------------------------------- > > Key: FLINK-15448 > URL: https://issues.apache.org/jira/browse/FLINK-15448 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.1 > Reporter: Victor Wong > Assignee: Victor Wong > Priority: Minor > > With Flink on Yarn, sometimes we ran into an exception like this: > {code:java} > java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id > container_xxxx timed out. > {code} > We'd like to find out the host of the lost TaskManager to log into it for > more details, we have to check the previous logs for the host information, > which is a little time-consuming. > Maybe we can add more descriptive information to ResourceID of Yarn > containers, e.g. "container_xxx@host_name:port_number". > Here's the demo: > {code:java} > class ResourceID { > final String resourceId; > final String details; > public ResourceID(String resourceId) { > this.resourceId = resourceId; > this.details = resourceId; > } > public ResourceID(String resourceId, String details) { > this.resourceId = resourceId; > this.details = details; > } > public String toString() { > return details; > } > } > // in flink-yarn > private void startTaskExecutorInContainer(Container container) { > final String containerIdStr = container.getId().toString(); > final String containerDetail = container.getId() + "@" + > container.getNodeId(); > final ResourceID resourceId = new ResourceID(containerIdStr, > containerDetail); > ... > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)