Alexey Serbin created KUDU-3576:
-----------------------------------

             Summary: An NPE thrown in Connection.exceptionCaught() makes the 
connection to corresponding tablet server unusable
                 Key: KUDU-3576
                 URL: https://issues.apache.org/jira/browse/KUDU-3576
             Project: Kudu
          Issue Type: Bug
          Components: client, java
    Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.12.0
            Reporter: Alexey Serbin
             Fix For: 1.18.0, 1.17.1


If a Kudu Java client application keeps a connection to a tablet server open 
and the tablet server is killed/restarted or a network error happens on the 
connection, the client application might end up in a state when it cannot 
communicate with the tablet server even after the tablet server is up and 
running again.  If the application tries to write to any tablet replica that is 
hosted at the tablet server, all such requests will timeout on the very first 
attempt, and the state of the connection to the server remains in a limbo since 
then.  The only way to get out of the trouble is to recreate the affected Java 
Kudu client instance, e.g., by restarting the application.

More details are below.

Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to 
access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the 
subsequent attempts to send Write RPC to any tablet replica hosted at the 
tablet server end up with a timeout on a very first attempt (i.e. there are no 
retries):
{noformat}
java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 
1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: 
Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" 
[0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], 
rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
 Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
complete before timeout: Batch{operations=1000, 
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
 Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
complete before timeout: Batch{operations=1000, 
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
 Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
complete before timeout: Batch{operations=1000, 
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
 Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
complete before timeout: Batch{operations=1000, 
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
 Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
{noformat}

The root cause of the problem manifests itself as an NPE in 
{{Connection.exceptionCaught()}} with a stack trace like below:
{noformat}
24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by 
org.apache.kudu.client.Connection$1.operationComplete()
 java.lang.NullPointerException
  at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434)
  at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
  at 
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
  at 
org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
  at 
org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
{noformat}

The issue was introduced with KUDU-1438 in changelist 
[57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to