Neither of the two nodes identified as having the range that IllegalStateException reports are mentioned by FailureDetector.java.
There are 5 endpoints that FailureDetector says are 'unknown endpoint' but all of them are reported as "UP" by Gossiper.java before the "schema complete, ready to bootstrap" line, with one exception that lags just a little and gets marked as "UP" by Gossiper just before the 30 second ring delay is logged. In short everything seems to be up *before* bootstrap actually starts. One thing FailureDetector.java does report is about 150 "Ignoring interval time ..." messages pretty much during the entire process. Thanks for your suggestion! Are there any other things we should check? Peter On Fri, Aug 15, 2014 at 3:36 AM, Mark Reddy <mark.re...@boxever.com> wrote: > Hi Peter, > > At the time of the IllegalStateException, do you see the node that it should > be streaming from marked as down by the failure detector? > > > Mark > > > On Fri, Aug 15, 2014 at 5:45 AM, Peter Haggerty <peter.hagge...@librato.com> > wrote: >> >> When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a >> cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap. >> This worked a half dozen times in the past few weeks as we've scaled >> this cluster from 21 to 24 and then to 27 nodes. There have been no >> configuration or Cassandra version changes since the cluster was >> booted several weeks ago. This is adding a single node at a time just >> as we've done before. >> >> When using a phi of 16 on the node that is bootstrapping we've had one >> failure and one success. Adjusting RING_DELAY from 30 seconds to 45 >> doesn't appear to help. >> >> We have DEBUG level logs for a successful attempt and a failed attempt >> but they seem quite similar in behavior, even bad behavior like ERRORS >> reported by MigrationTask.java and FailureDetector.java. >> >> >> In the failed attempts the bootstrap immediately falls over with >> "unable to find ...": >> java.lang.IllegalStateException: unable to find sufficient sources for >> streaming range (1445230693272554509,1449885986247309687] >> >> The logs clearly show that it knows where those ranges can be found: >> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) >> Bootstrap: range (1445230693272554509,1449885986247309687] exists on >> /10.11.12.13 >> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) >> Bootstrap: range (1445230693272554509,1449885986247309687] exists on >> /10.11.12.14 >> >> If you look for the various non-debug log lines for these addresses >> you'll see some odd behavior, but you see similar odd behavior in the >> logs of the node that is able to successfully bootstrap. >> $ grep '10.11.12.13' system.log | grep -v ^DEBUG >> INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910) >> Node /10.11.12.13 is now part of the cluster >> INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910) >> Node /10.11.12.13 is now part of the cluster >> INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571 >> OutboundTcpConnection.java (line 386) Handshaking version with >> /10.11.12.13 >> INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java >> (line 876) InetAddress /10.11.12.13 is now UP >> >> $ grep '10.11.12.14' system.log | grep -v ^DEBUG >> INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910) >> Node /10.11.12.14 is now part of the cluster >> INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910) >> Node /10.11.12.14 is now part of the cluster >> >> The general order of events, as told by JOINING lines: >> $ grep 'JOINING' system.log >> INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009) >> JOINING: waiting for ring information >> INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) >> JOINING: schema complete, ready to bootstrap >> INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) >> JOINING: waiting for pending range calculation >> INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) >> JOINING: calculation complete, ready to bootstrap >> INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) >> JOINING: getting bootstrap token >> INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009) >> JOINING: sleeping 30000 ms for pending range setup >> INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009) >> JOINING: Starting to bootstrap... >> >> >> Thanks in advance for any suggestions. >> >> >> Peter > >