When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a
cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap.
This worked a half dozen times in the past few weeks as we've scaled
this cluster from 21 to 24 and then to 27 nodes. There have been no
configuration or Cassandra version changes since the cluster was
booted several weeks ago. This is adding a single node at a time just
as we've done before.

When using a phi of 16 on the node that is bootstrapping we've had one
failure and one success. Adjusting RING_DELAY from 30 seconds to 45
doesn't appear to help.

We have DEBUG level logs for a successful attempt and a failed attempt
but they seem quite similar in behavior, even bad behavior like ERRORS
reported by MigrationTask.java and FailureDetector.java.


In the failed attempts the bootstrap immediately falls over with
"unable to find ...":
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range (1445230693272554509,1449885986247309687]

The logs clearly show that it knows where those ranges can be found:
DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122)
Bootstrap: range (1445230693272554509,1449885986247309687] exists on
/10.11.12.13
DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122)
Bootstrap: range (1445230693272554509,1449885986247309687] exists on
/10.11.12.14

If you look for the various non-debug log lines for these addresses
you'll see some odd behavior, but you see similar odd behavior in the
logs of the node that is able to successfully bootstrap.
$ grep '10.11.12.13' system.log | grep -v ^DEBUG
 INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910)
Node /10.11.12.13 is now part of the cluster
 INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910)
Node /10.11.12.13 is now part of the cluster
 INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571
OutboundTcpConnection.java (line 386) Handshaking version with
/10.11.12.13
 INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP

$ grep '10.11.12.14' system.log | grep -v ^DEBUG
 INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910)
Node /10.11.12.14 is now part of the cluster
 INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910)
Node /10.11.12.14 is now part of the cluster

The general order of events, as told by JOINING lines:
$ grep 'JOINING' system.log
 INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009)
JOINING: waiting for ring information
 INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
JOINING: schema complete, ready to bootstrap
 INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
JOINING: waiting for pending range calculation
 INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
JOINING: calculation complete, ready to bootstrap
 INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
JOINING: getting bootstrap token
 INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009)
JOINING: sleeping 30000 ms for pending range setup
 INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009)
JOINING: Starting to bootstrap...


Thanks in advance for any suggestions.


Peter

Reply via email to