[
https://issues.apache.org/jira/browse/KUDU-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174035#comment-16174035
]
Todd Lipcon commented on KUDU-2152:
-----------------------------------
Found a log on the other side of this tablet copy:
{code}
W0920 16:41:48.582257 450 consensus_peers.cc:396] T
271df8901d98442cb478593babd8a609 P 20d4d86f182043398594b67492d13fdc -> Peer
c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): Unable to
begin Tablet Copy on peer: error { code: UNKNOWN_ERROR status { code: NOT_FOUND
message: "Could not replace superblock with COPYING data state: Failed to write
tablet metadata 271df8901d98442cb478593babd8a609: Failed to rename tmp file to
/data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609:
/data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: No
such file or directory" posix_code: 2 } }
{code}
> Tablet stuck under-replicated after some kind of tablet copy issue
> ------------------------------------------------------------------
>
> Key: KUDU-2152
> URL: https://issues.apache.org/jira/browse/KUDU-2152
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 1.5.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> I was stress testing with the following setup:
> - 8 servers (n1-standard-4 GCE boxes)
> - created a bunch of 100-tablet tablets using loadgen until I had ~2500
> replicas on each server
> - mounted another server using sshfs and put cmeta on that mount point (to
> make slower cmeta writes)
> - stress -c4 on all machines
> - shut down a server and wait for re-replication (green ksck), restart the
> server, rinse repeat
> Eventually I got a stuck tablet. ksck reports:
> {code}
> Tablet 271df8901d98442cb478593babd8a609 of table
> 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1
> replica(s) not RUNNING
> 20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING
> [LEADER]
> c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad
> state
> State: STOPPED
> Data state: TABLET_DATA_COPYING
> Last status: Deleted tablet blocks from disk
> cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING
> 1 replicas' active configs differ from the master's.
> All the peers reported by the master and tablet servers are:
> A = 20d4d86f182043398594b67492d13fdc
> D = 471027436ee8405ab7cdf8d22407696b
> B = c2ea8f22f4034bcc97e26c9236811960
>
> C = cd0997b908ad41839f56a1b61210f2d4
> The consensus matrix is:
> Config source | Voters | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
> master | A* B C | | | Yes
> A | A* B C | 11 | 29 | Yes
> B | D B C | 9 | 23 | Yes
> C | A* B C | 11 | 29 | Yes
> {code}
> The leader ("A" above) just keeps reporting that it's failing to send
> requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts
> it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming
> that it actually means BOOTSTRAPPING).
> "B"'s last bit in the logs were:
> {code}
> I0920 16:41:48.556422 3808 tablet_copy_client.cc:209] T
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet
> copy: Beginning tablet copy session from remote peer at address
> kudu513-8.gce.cloudera.com:7050
> I0920 16:41:48.562335 3808 ts_tablet_manager.cc:1118] T
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting
> tablet data with delete state TABLET_DATA_COPYING
> W0920 16:41:48.578610 3808 env_util.cc:277] Failed to determine if path is a
> directory:
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy:
> Not found:
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy:
> No such file or directory (error 2)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)