[ 
https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868127#comment-17868127
 ] 

ASF subversion and git services commented on KUDU-3349:
-------------------------------------------------------

Commit e44e0d4892b0e2469a18aefb78062f5aa2e1799c in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e44e0d489 ]

[client] add ScanTokenStaleRaftMembershipTest

This patch adds a new test scenario TabletLeaderChange into the newly
added ScanTokenStaleRaftMembershipTest fixture.  The motivation for this
patch was a request to clarify on the Kudu C++ client's behavior in
particular scenarios, which on itself was in the context of a follow-up
to KUDU-3349.

Change-Id: I6ce3d549d4ab2502c58deae1250b49ba16bbc914
Reviewed-on: http://gerrit.cloudera.org:8080/21580
Reviewed-by: Ashwani Raina <ara...@cloudera.com>
Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com>
Tested-by: Alexey Serbin <ale...@apache.org>


> Kudu java client failed to demote leader and caused a lot of deleting rows 
> timeout
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-3349
>                 URL: https://issues.apache.org/jira/browse/KUDU-3349
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Redriver
>            Priority: Major
>             Fix For: 1.16.0
>
>
> During deleting rows through Spark, I found a lot of PendingErrors which 
> caused timeout, the deleting takes very long time, and finally failed 
> sometimes.
> {code:java}
> java.lang.RuntimeException: PendingErrors overflowed. Failed to write at 
> least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before 
> timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" 
> [0x00000006, 0x00000007), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, 
> TimeoutTracker(timeout=30000, elapsed=26852), Traces: [0ms] sending RPC to 
> server <ByteString@1d59316e size=32 
> contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [2ms] received response from server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [10ms] sending RPC to server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, 
> [12ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [12ms] received response from server <ByteString@1d59316e 
> size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [30ms] sending RPC to server <ByteString@1d59316e size=32 
> contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [32ms] received response from server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [51ms] sending RPC to server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, 
> [52ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [52ms] received response from server <ByteString@1d59316e 
> size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [70ms] sending RPC to server <ByteString@1d59316e size=32 
> contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [72ms] received response from server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [90ms] sending RPC to server 
> <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, 
> [92ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [92ms] received response from server <ByteString@1d59316e 
> size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [170ms] sending RPC to server <ByteString@1d59316e 
> size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC 
> due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader 
> of this config: current role FOLLOWER (error 0), [172ms] received response 
> from server <ByteString@1d59316e size=32 
> contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [190ms] sending RPC to server <ByteString@1d59316e 
> size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC 
> due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader 
> of this config: current role FOLLOWER (error 0), [192ms] received response 
> from server <ByteString@1d59316e size=32 
> contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (erro
> ...
> {code}
> Anyway, I think I found the RCA.
> After enabling DEBUG level logging, I found something interesting. The UUID 
> is in fact the same, but kudu-client thinks they are not. See the below.
> {code:java}
> 22/01/19 23:34:34,621 DEBUG [kudu-nio-3] client.RemoteTablet:170 : 
> <ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> 
> wasn't the leader for 44fa35c99e7042329bbfa0268c1cd4de, current leader is 
> <ByteString@5aadafd0 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">
> {code}
> This issue caused the kudu-client to fail to demote the leader.
> {code:java}
>   void demoteLeader(String uuid) {
>     synchronized (tabletServers) {
>       if (leaderUuid == null) {
>         LOG.debug("{} couldn't be demoted as the leader for {}, there is no 
> known leader",
>             uuid, getTabletId());
>         return;
>       }
>       if (leaderUuid.equals(uuid)) {
>         leaderUuid = null;
>         LOG.debug("{} was demoted as the leader for {}", uuid, getTabletId());
>       } else {
>         LOG.debug("{} wasn't the leader for {}, current leader is {}", uuid,
>             getTabletId(), leaderUuid);
>       }
>     }
>   }
> {code}
> I take some time to debug, and finally found the tserver's uuid is generated 
> by the "serverMetadataPB.getUuid().toString()". 
> [https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L246]
> The correct way to get Uuid() is "serverMetadataPB.getUuid().toStringUtf8()"
> After fixing this bug, the deleting becomes faster than before because the 
> client will not send write to the wrong leader.
> I'll submit a patch for this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to