[ 
https://issues.apache.org/jira/browse/KUDU-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3576.
---------------------------------
    Resolution: Fixed

> An NPE thrown in Connection.exceptionCaught() makes the connection to 
> corresponding tablet server unusable
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3576
>                 URL: https://issues.apache.org/jira/browse/KUDU-3576
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, java
>    Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
>            Reporter: Alexey Serbin
>            Priority: Major
>             Fix For: 1.18.0, 1.17.1
>
>
> If a Kudu Java client application keeps a connection to a tablet server open 
> and the tablet server is killed/restarted or a network error happens on the 
> connection, the client application might end up in a state when it cannot 
> communicate with the tablet server even after the tablet server is up and 
> running again.  If the application tries to write to any tablet replica that 
> is hosted at the tablet server, all such requests will timeout on the very 
> first attempt, and the state of the connection to the server remains in a 
> limbo since then.  The only way to get out of the trouble is to recreate the 
> affected Java Kudu client instance, e.g., by restarting the application.
> More details are below.
> Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to 
> access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the 
> subsequent attempts to send Write RPC to any tablet replica hosted at the 
> tablet server end up with a timeout on a very first attempt (i.e. there are 
> no retries):
> {noformat}
> java.lang.RuntimeException: PendingErrors overflowed. Failed to write at 
> least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before 
> timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" 
> [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], 
> rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
>  Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
> complete before timeout: Batch{operations=1000, 
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
>  Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
> complete before timeout: Batch{operations=1000, 
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
>  Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
> complete before timeout: Batch{operations=1000, 
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
>  Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot 
> complete before timeout: Batch{operations=1000, 
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, 
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), 
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
>  Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
> {noformat}
> The root cause of the problem manifests itself as an NPE in 
> {{Connection.exceptionCaught()}} with a stack trace like below:
> {noformat}
> 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by 
> org.apache.kudu.client.Connection$1.operationComplete()
>  java.lang.NullPointerException
>   at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434)
>   at 
> org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
>   at 
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
>   at 
> org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
>   at 
> org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The issue was introduced with KUDU-1438 in changelist 
> [57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to