[ https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868127#comment-17868127 ]
ASF subversion and git services commented on KUDU-3349: ------------------------------------------------------- Commit e44e0d4892b0e2469a18aefb78062f5aa2e1799c in kudu's branch refs/heads/master from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e44e0d489 ] [client] add ScanTokenStaleRaftMembershipTest This patch adds a new test scenario TabletLeaderChange into the newly added ScanTokenStaleRaftMembershipTest fixture. The motivation for this patch was a request to clarify on the Kudu C++ client's behavior in particular scenarios, which on itself was in the context of a follow-up to KUDU-3349. Change-Id: I6ce3d549d4ab2502c58deae1250b49ba16bbc914 Reviewed-on: http://gerrit.cloudera.org:8080/21580 Reviewed-by: Ashwani Raina <ara...@cloudera.com> Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com> Tested-by: Alexey Serbin <ale...@apache.org> > Kudu java client failed to demote leader and caused a lot of deleting rows > timeout > ---------------------------------------------------------------------------------- > > Key: KUDU-3349 > URL: https://issues.apache.org/jira/browse/KUDU-3349 > Project: Kudu > Issue Type: Bug > Reporter: Redriver > Priority: Major > Fix For: 1.16.0 > > > During deleting rows through Spark, I found a lot of PendingErrors which > caused timeout, the deleting takes very long time, and finally failed > sometimes. > {code:java} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" > [0x00000006, 0x00000007), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, > TimeoutTracker(timeout=30000, elapsed=26852), Traces: [0ms] sending RPC to > server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [2ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [10ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [12ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [12ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [30ms] sending RPC to server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [32ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [51ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [52ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [52ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [70ms] sending RPC to server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [72ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [90ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [92ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [92ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [170ms] sending RPC to server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC > due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader > of this config: current role FOLLOWER (error 0), [172ms] received response > from server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [190ms] sending RPC to server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC > due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader > of this config: current role FOLLOWER (error 0), [192ms] received response > from server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (erro > ... > {code} > Anyway, I think I found the RCA. > After enabling DEBUG level logging, I found something interesting. The UUID > is in fact the same, but kudu-client thinks they are not. See the below. > {code:java} > 22/01/19 23:34:34,621 DEBUG [kudu-nio-3] client.RemoteTablet:170 : > <ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> > wasn't the leader for 44fa35c99e7042329bbfa0268c1cd4de, current leader is > <ByteString@5aadafd0 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> > {code} > This issue caused the kudu-client to fail to demote the leader. > {code:java} > void demoteLeader(String uuid) { > synchronized (tabletServers) { > if (leaderUuid == null) { > LOG.debug("{} couldn't be demoted as the leader for {}, there is no > known leader", > uuid, getTabletId()); > return; > } > if (leaderUuid.equals(uuid)) { > leaderUuid = null; > LOG.debug("{} was demoted as the leader for {}", uuid, getTabletId()); > } else { > LOG.debug("{} wasn't the leader for {}, current leader is {}", uuid, > getTabletId(), leaderUuid); > } > } > } > {code} > I take some time to debug, and finally found the tserver's uuid is generated > by the "serverMetadataPB.getUuid().toString()". > [https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L246] > The correct way to get Uuid() is "serverMetadataPB.getUuid().toStringUtf8()" > After fixing this bug, the deleting becomes faster than before because the > client will not send write to the wrong leader. > I'll submit a patch for this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)