[ https://issues.apache.org/jira/browse/HADOOP-19061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xing Lin resolved HADOOP-19061. ------------------------------- Fix Version/s: 3.5.0 Target Version/s: (was: 3.3.9, 3.5.0) Resolution: Fixed > Capture exception in rpcRequestSender.start() in IPC.Connection.run() > --------------------------------------------------------------------- > > Key: HADOOP-19061 > URL: https://issues.apache.org/jira/browse/HADOOP-19061 > Project: Hadoop Common > Issue Type: Bug > Components: ipc > Affects Versions: 3.5.0 > Reporter: Xing Lin > Assignee: Xing Lin > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > rpcRequestThread.start() can fail due to OOM. This will immediately crash the > Connection thread, without removing itself from the connections pool. Then > for all following getConnection(remoteid), we will get this bad connection > object and all rpc requests will be hanging, because this is a bad connection > object, without threads being properly running (Neither Connection or > Connection.rpcRequestSender thread is running due to OOM.). > In this PR, we moved the rpcRequestThread.start() to be within the > try{}-catch{} block, to capture OOM from rpcRequestThread.start() and proper > cleaning is followed if we hit OOM. > {code:java} > IPC.Connection.run() > @Override > public void run() { > // Don't start the ipc parameter sending thread until we start this > // thread, because the shutdown logic only gets triggered if this > // thread is started. > rpcRequestThread.start(); > if (LOG.isDebugEnabled()) > LOG.debug(getName() + ": starting, having connections " > + connections.size()); > try { > while (waitForWork()) {//wait here for work - read or close connection > receiveRpcResponse(); > } > } catch (Throwable t) { > // This truly is unexpected, since we catch IOException in > receiveResponse > // -- this is only to be really sure that we don't leave a client > hanging > // forever. > LOG.warn("Unexpected error reading responses on connection " + this, > t); > markClosed(new IOException("Error reading responses", t)); > }{code} > Because there is no rpcRequestSender thread consuming the rpcRequestQueue, > all rpc request enqueue operations for this connection will be blocked and > will be hanging at this while loop forever during sendRpcRequest(). > {code:java} > while (!shouldCloseConnection.get()) { > if (rpcRequestQueue.offer(Pair.of(call, buf), 1, TimeUnit.SECONDS)) { > break; > } > }{code} > OOM exception in starting the rpcRequestSender thread. > {code:java} > Exception in thread "IPC Client (1664093259) connection to > nn01.grid.linkedin.com/IP-Address:portNum from kafkaetl" > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1034) > {code} > Multiple threads blocked by queue.offer(). and we don't found any "IPC > Client" or "IPC Parameter Sending Thread" in thread dump. > {code:java} > Thread 2156123: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) > @bci=20, line=215 (Compiled frame) > - > java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferQueue$QNode, > java.lang.Object, boolean, long) @bci=156, line=764 (Compiled frame) > - > java.util.concurrent.SynchronousQueue$TransferQueue.transfer(java.lang.Object, > boolean, long) @bci=148, line=695 (Compiled frame) > - java.util.concurrent.SynchronousQueue.offer(java.lang.Object, long, > java.util.concurrent.TimeUnit) @bci=24, line=895 (Compiled frame) > - > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(org.apache.hadoop.ipc.Client$Call) > @bci=88, line=1134 (Compiled frame) > - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, > org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, > int, java.util.concurrent.atomic.AtomicBoolean, > org.apache.hadoop.ipc.AlignmentContext) @bci=36, line=1402 (Interpreted frame) > - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, > org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, > java.util.concurrent.atomic.AtomicBoolean, > org.apache.hadoop.ipc.AlignmentContext) @bci=9, line=1349 (Compiled frame) > - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, > java.lang.reflect.Method, java.lang.Object[]) @bci=248, line=230 (Compiled > frame) > - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, > java.lang.reflect.Method, java.lang.Object[]) @bci=4, line=118 (Compiled > frame) > - com.sun.proxy.$Proxy11.getBlockLocations({code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org