I'll add that while we don't see either of the nodes with the range that failed marked as down one of them *does* appear to have a late arrival of the cluster of UP messages that happens just after the bootstrap fails:
$ egrep 'JOIN|Illegal|10.11.12.14|10.11.12.13' system.log | grep -v ^DEBUG INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009) JOINING: waiting for ring information INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: schema complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: waiting for pending range calculation INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: calculation complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: getting bootstrap token INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009) JOINING: sleeping 30000 ms for pending range setup INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009) JOINING: Starting to bootstrap... java.lang.IllegalStateException: unable to find sufficient sources for streaming range (1445230693272554509,1449885986247309687] INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571 OutboundTcpConnection.java (line 386) Handshaking version with /10.11.12.13 INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP Peter On Fri, Aug 15, 2014 at 3:36 AM, Mark Reddy <mark.re...@boxever.com> wrote: > Hi Peter, > > At the time of the IllegalStateException, do you see the node that it should > be streaming from marked as down by the failure detector? > > > Mark > > > On Fri, Aug 15, 2014 at 5:45 AM, Peter Haggerty <peter.hagge...@librato.com> > wrote: >> >> When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a >> cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap. >> This worked a half dozen times in the past few weeks as we've scaled >> this cluster from 21 to 24 and then to 27 nodes. There have been no >> configuration or Cassandra version changes since the cluster was >> booted several weeks ago. This is adding a single node at a time just >> as we've done before. >> >> When using a phi of 16 on the node that is bootstrapping we've had one >> failure and one success. Adjusting RING_DELAY from 30 seconds to 45 >> doesn't appear to help. >> >> We have DEBUG level logs for a successful attempt and a failed attempt >> but they seem quite similar in behavior, even bad behavior like ERRORS >> reported by MigrationTask.java and FailureDetector.java. >> >> >> In the failed attempts the bootstrap immediately falls over with >> "unable to find ...": >> java.lang.IllegalStateException: unable to find sufficient sources for >> streaming range (1445230693272554509,1449885986247309687] >> >> The logs clearly show that it knows where those ranges can be found: >> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) >> Bootstrap: range (1445230693272554509,1449885986247309687] exists on >> /10.11.12.13 >> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) >> Bootstrap: range (1445230693272554509,1449885986247309687] exists on >> /10.11.12.14 >> >> If you look for the various non-debug log lines for these addresses >> you'll see some odd behavior, but you see similar odd behavior in the >> logs of the node that is able to successfully bootstrap. >> $ grep '10.11.12.13' system.log | grep -v ^DEBUG >> INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910) >> Node /10.11.12.13 is now part of the cluster >> INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910) >> Node /10.11.12.13 is now part of the cluster >> INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571 >> OutboundTcpConnection.java (line 386) Handshaking version with >> /10.11.12.13 >> INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> >> $ grep '10.11.12.14' system.log | grep -v ^DEBUG >> INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910) >> Node /10.11.12.14 is now part of the cluster >> INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910) >> Node /10.11.12.14 is now part of the cluster >> >> The general order of events, as told by JOINING lines: >> $ grep 'JOINING' system.log >> INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009) >> JOINING: waiting for ring information >> INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) >> JOINING: schema complete, ready to bootstrap >> INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) >> JOINING: waiting for pending range calculation >> INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) >> JOINING: calculation complete, ready to bootstrap >> INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) >> JOINING: getting bootstrap token >> INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009) >> JOINING: sleeping 30000 ms for pending range setup >> INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009) >> JOINING: Starting to bootstrap... >> >> >> Thanks in advance for any suggestions. >> >> >> Peter > >