[ 
https://issues.apache.org/jira/browse/KUDU-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405252#comment-16405252
 ] 

Mike Percy commented on KUDU-2152:
----------------------------------

I think this may have a related cause to KUDU-2293, where certain faults are no 
longer fatal due to disk failure work and so our error handling isn't as robust 
as it should be in the tablet copy client cleanup code.

> Tablet stuck under-replicated after some kind of tablet copy issue
> ------------------------------------------------------------------
>
>                 Key: KUDU-2152
>                 URL: https://issues.apache.org/jira/browse/KUDU-2152
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.5.0
>            Reporter: Todd Lipcon
>            Assignee: Andrew Wong
>            Priority: Critical
>         Attachments: raft_consensus_stress-itest.txt.gz
>
>
> I was stress testing with the following setup:
> - 8 servers (n1-standard-4 GCE boxes)
> - created a bunch of 100-tablet tablets using loadgen until I had ~2500 
> replicas on each server
> - mounted another server using sshfs and put cmeta on that mount point (to 
> make slower cmeta writes)
> - stress -c4 on all machines
> - shut down a server and wait for re-replication (green ksck), restart the 
> server, rinse repeat
> Eventually I got a stuck tablet. ksck reports:
> {code}
> Tablet 271df8901d98442cb478593babd8a609 of table 
> 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 
> replica(s) not RUNNING
>   20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING 
> [LEADER]
>   c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad 
> state
>     State:       STOPPED
>     Data state:  TABLET_DATA_COPYING
>     Last status: Deleted tablet blocks from disk
>   cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING
> 1 replicas' active configs differ from the master's.
>   All the peers reported by the master and tablet servers are:
>   A = 20d4d86f182043398594b67492d13fdc
>   D = 471027436ee8405ab7cdf8d22407696b
>   B = c2ea8f22f4034bcc97e26c9236811960
>  
>  C = cd0997b908ad41839f56a1b61210f2d4
> The consensus matrix is:
>  Config source |      Voters      | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A*      B   C    |              |              | Yes
>  A             | A*      B   C    | 11           | 29           | Yes
>  B             |     D   B   C    | 9            | 23           | Yes
>  C             | A*      B   C    | 11           | 29           | Yes
> {code}
> The leader ("A" above) just keeps reporting that it's failing to send 
> requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts 
> it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming 
> that it actually means BOOTSTRAPPING).
> "B"'s last bit in the logs were:
> {code}
> I0920 16:41:48.556422  3808 tablet_copy_client.cc:209] T 
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet 
> copy: Beginning tablet copy session from remote peer at address 
> kudu513-8.gce.cloudera.com:7050
> I0920 16:41:48.562335  3808 ts_tablet_manager.cc:1118] T 
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting 
> tablet data with delete state TABLET_DATA_COPYING
> W0920 16:41:48.578610  3808 env_util.cc:277] Failed to determine if path is a 
> directory: 
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: 
> Not found: 
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: 
> No such file or directory (error 2)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to