Alexey Serbin created KUDU-3576: ----------------------------------- Summary: An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable Key: KUDU-3576 URL: https://issues.apache.org/jira/browse/KUDU-3576 Project: Kudu Issue Type: Bug Components: client, java Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.12.0 Reporter: Alexey Serbin Fix For: 1.18.0, 1.17.1
If a Kudu Java client application keeps a connection to a tablet server open and the tablet server is killed/restarted or a network error happens on the connection, the client application might end up in a state when it cannot communicate with the tablet server even after the tablet server is up and running again. If the application tries to write to any tablet replica that is hosted at the tablet server, all such requests will timeout on the very first attempt, and the state of the connection to the server remains in a limbo since then. The only way to get out of the trouble is to recreate the affected Java Kudu client instance, e.g., by restarting the application. More details are below. Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the subsequent attempts to send Write RPC to any tablet replica hosted at the tablet server end up with a timeout on a very first attempt (i.e. there are no retries): {noformat} java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))} {noformat} The root cause of the problem manifests itself as an NPE in {{Connection.exceptionCaught()}} with a stack trace like below: {noformat} 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by org.apache.kudu.client.Connection$1.operationComplete() java.lang.NullPointerException at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) at org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) at org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} The issue was introduced with KUDU-1438 in changelist [57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be]. -- This message was sent by Atlassian Jira (v8.20.10#820010)