[ https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoqiao He resolved HDFS-17250. -------------------------------- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > EditLogTailer#triggerActiveLogRoll should handle thread Interrupted > ------------------------------------------------------------------- > > Key: HDFS-17250 > URL: https://issues.apache.org/jira/browse/HDFS-17250 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Haiyang Hu > Assignee: Haiyang Hu > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > *Issue:* > When the NameNode attempts to trigger a log roll and the cachedActiveProxy is > a "the machine has been shut down of the namenode," it is unable to establish > a network connection. This results in a timeout during the socket connection > phase, which has a set timeout of 90 seconds. Since the asynchronous call for > "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout > and initiates a "cancel" operation, causing the executing thread to receive > an "Interrupted" signal and throwing a "java.io.InterruptedIOException" > exception. > Currently, the logic not to handle interrupted signal, and the > "getActiveNodeProxy" method hasn't reached the maximum retry limit, the > overall execution process doesn't exit and it continues to attempt to > call the "rollEditLog" on the next NameNode in the list. However when a > socket connection is established, it throws a > "java.nio.channels.ClosedByInterruptException" exception due to the thread > being in an "Interrupted" state. > this cycle repeats until it reaches the maximum retry limit (nnCount * > maxRetries) will exits. > However in the next cycle of "Triggering log roll," it continues to traverse > the NameNode list and encounters the same issue and the cachedActiveProxy is > still a "shut down NameNode." > This eventually results in the NameNode being unable to successfully complete > the "Triggering log roll" operation. > To optimize this, we need to handle the thread being interrupted and exit the > execution. > *Detailed logs such as:* > the Observer node "ob1" will execute "Triggering log roll" is as follows: > nns list is [ob2(shut down machine),nn1(active),nn2(standy)] > * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and > asynchronously calls "getNameNodeProxy" to request "ob2" for the > "rollEditLog" operation, since the "ob2" machine is shut down, it cannot > establish a network connection, this results in a timeout during the socket > connection phase (here set timeout is 90 seconds). > {quote} > 2023-11-03 10:27:41,734 INFO ha.EditLogTailer > (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering > log roll on remote NameNode > 2023-11-03 10:28:41,734 WARN ha.EditLogTailer > (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to > finish rolling edits in 60000 ms > {quote} > * As the asynchronous call for "Triggering log roll" has a waiting time of 60 > seconds, it triggers a timeout and initiates a "cancel" operation, causing > the executing thread to receive an "Interrupted" signal and will throw > "java.io.InterruptedIOException". > {quote} > 2023-11-03 10:28:41,753 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:41,972 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next. > java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort > xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted > while waiting for IO on channel > java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total > timeout mills is 90000, 30038 millis timeout left. > at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583) > at org.apache.hadoop.ipc.Client.call(Client.java:1511) > at org.apache.hadoop.ipc.Client.call(Client.java:1402) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141) > at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO > on channel java.nio.channels.SocketChannel[connection-pending > remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left. > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653) > {quote} > * Since current logic not to handle interrupted signal and hasn't reached > the maximum retry limit, the overall execution process doesn't exit and it > continues to attempt to call the "rollEditLog" on the next NameNode in the > list. However when a socket connection is established, it throws a > "java.nio.channels.ClosedByInterruptException" exception due to the thread > being in an "Interrupted" state. this cycle repeats until it reaches 9 retry > limit (3*3) and then exits. > {quote} > 2023-11-03 10:28:41,991 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:41,991 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653) > at org.apache.hadoop.ipc.Client.call(Client.java:1449) > at org.apache.hadoop.ipc.Client.call(Client.java:1402) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141) > at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2023-11-03 10:28:42,095 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,095 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > 2023-11-03 10:28:42,104 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,104 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > 2023-11-03 10:28:42,118 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,118 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > 2023-11-03 10:28:42,119 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,119 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > 2023-11-03 10:28:42,148 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > 2023-11-03 10:28:42,148 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > 2023-11-03 10:28:42,149 WARN ipc.Client > (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted > while trying for connection > 2023-11-03 10:28:42,149 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) > [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo > [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next. > java.nio.channels.ClosedByInterruptException > {quote} > * In the next cycle of "Triggering log roll," it continues to traverse the > NameNode list and proceeds to "ob2," causing the same issue to recur. This > eventually results in the NameNode being unable to successfully complete the > "Triggering log roll" operation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org