[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xudong Cao resolved HDFS-15069. ------------------------------- Resolution: Duplicate > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > -------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.1.3 > Reporter: Xudong Cao > Assignee: Xudong Cao > Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org