[ https://issues.apache.org/jira/browse/KUDU-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin resolved KUDU-3576. --------------------------------- Resolution: Fixed > An NPE thrown in Connection.exceptionCaught() makes the connection to > corresponding tablet server unusable > ---------------------------------------------------------------------------------------------------------- > > Key: KUDU-3576 > URL: https://issues.apache.org/jira/browse/KUDU-3576 > Project: Kudu > Issue Type: Bug > Components: client, java > Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0 > Reporter: Alexey Serbin > Priority: Major > Fix For: 1.18.0, 1.17.1 > > > If a Kudu Java client application keeps a connection to a tablet server open > and the tablet server is killed/restarted or a network error happens on the > connection, the client application might end up in a state when it cannot > communicate with the tablet server even after the tablet server is up and > running again. If the application tries to write to any tablet replica that > is hosted at the tablet server, all such requests will timeout on the very > first attempt, and the state of the connection to the server remains in a > limbo since then. The only way to get out of the trouble is to recreate the > affected Java Kudu client instance, e.g., by restarting the application. > More details are below. > Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to > access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the > subsequent attempts to send Write RPC to any tablet replica hosted at the > tablet server end up with a timeout on a very first attempt (i.e. there are > no retries): > {noformat} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" > [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], > rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, > 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, > 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, > 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, > 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))} > {noformat} > The root cause of the problem manifests itself as an NPE in > {{Connection.exceptionCaught()}} with a stack trace like below: > {noformat} > 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by > org.apache.kudu.client.Connection$1.operationComplete() > java.lang.NullPointerException > at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) > at > org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) > at > org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) > at > org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) > at > org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) > at > org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) > at > org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) > at > org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) > at > org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) > at > org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > The issue was introduced with KUDU-1438 in changelist > [57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be]. -- This message was sent by Atlassian Jira (v8.20.10#820010)