When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap. This worked a half dozen times in the past few weeks as we've scaled this cluster from 21 to 24 and then to 27 nodes. There have been no configuration or Cassandra version changes since the cluster was booted several weeks ago. This is adding a single node at a time just as we've done before.
When using a phi of 16 on the node that is bootstrapping we've had one failure and one success. Adjusting RING_DELAY from 30 seconds to 45 doesn't appear to help. We have DEBUG level logs for a successful attempt and a failed attempt but they seem quite similar in behavior, even bad behavior like ERRORS reported by MigrationTask.java and FailureDetector.java. In the failed attempts the bootstrap immediately falls over with "unable to find ...": java.lang.IllegalStateException: unable to find sufficient sources for streaming range (1445230693272554509,1449885986247309687] The logs clearly show that it knows where those ranges can be found: DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) Bootstrap: range (1445230693272554509,1449885986247309687] exists on /10.11.12.13 DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) Bootstrap: range (1445230693272554509,1449885986247309687] exists on /10.11.12.14 If you look for the various non-debug log lines for these addresses you'll see some odd behavior, but you see similar odd behavior in the logs of the node that is able to successfully bootstrap. $ grep '10.11.12.13' system.log | grep -v ^DEBUG INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571 OutboundTcpConnection.java (line 386) Handshaking version with /10.11.12.13 INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP $ grep '10.11.12.14' system.log | grep -v ^DEBUG INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster The general order of events, as told by JOINING lines: $ grep 'JOINING' system.log INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009) JOINING: waiting for ring information INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: schema complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: waiting for pending range calculation INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: calculation complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: getting bootstrap token INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009) JOINING: sleeping 30000 ms for pending range setup INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009) JOINING: Starting to bootstrap... Thanks in advance for any suggestions. Peter