I'll add that while we don't see either of the nodes with the range
that failed marked as down one of them *does* appear to have a late
arrival of the cluster of UP messages that happens just after the
bootstrap fails:

$ egrep 'JOIN|Illegal|10.11.12.14|10.11.12.13' system.log  | grep -v ^DEBUG
 INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910)
Node /10.11.12.14 is now part of the cluster
 INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910)
Node /10.11.12.13 is now part of the cluster
 INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009)
JOINING: waiting for ring information
 INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
JOINING: schema complete, ready to bootstrap
 INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
JOINING: waiting for pending range calculation
 INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
JOINING: calculation complete, ready to bootstrap
 INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
JOINING: getting bootstrap token
 INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009)
JOINING: sleeping 30000 ms for pending range setup
 INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910)
Node /10.11.12.14 is now part of the cluster
 INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910)
Node /10.11.12.13 is now part of the cluster
 INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009)
JOINING: Starting to bootstrap...
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range (1445230693272554509,1449885986247309687]
 INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571
OutboundTcpConnection.java (line 386) Handshaking version with
/10.11.12.13
 INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP
 INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java
(line 876) InetAddress /10.11.12.13 is now UP


Peter



On Fri, Aug 15, 2014 at 3:36 AM, Mark Reddy <mark.re...@boxever.com> wrote:
> Hi Peter,
>
> At the time of the IllegalStateException, do you see the node that it should
> be streaming from marked as down by the failure detector?
>
>
> Mark
>
>
> On Fri, Aug 15, 2014 at 5:45 AM, Peter Haggerty <peter.hagge...@librato.com>
> wrote:
>>
>> When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a
>> cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap.
>> This worked a half dozen times in the past few weeks as we've scaled
>> this cluster from 21 to 24 and then to 27 nodes. There have been no
>> configuration or Cassandra version changes since the cluster was
>> booted several weeks ago. This is adding a single node at a time just
>> as we've done before.
>>
>> When using a phi of 16 on the node that is bootstrapping we've had one
>> failure and one success. Adjusting RING_DELAY from 30 seconds to 45
>> doesn't appear to help.
>>
>> We have DEBUG level logs for a successful attempt and a failed attempt
>> but they seem quite similar in behavior, even bad behavior like ERRORS
>> reported by MigrationTask.java and FailureDetector.java.
>>
>>
>> In the failed attempts the bootstrap immediately falls over with
>> "unable to find ...":
>> java.lang.IllegalStateException: unable to find sufficient sources for
>> streaming range (1445230693272554509,1449885986247309687]
>>
>> The logs clearly show that it knows where those ranges can be found:
>> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122)
>> Bootstrap: range (1445230693272554509,1449885986247309687] exists on
>> /10.11.12.13
>> DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122)
>> Bootstrap: range (1445230693272554509,1449885986247309687] exists on
>> /10.11.12.14
>>
>> If you look for the various non-debug log lines for these addresses
>> you'll see some odd behavior, but you see similar odd behavior in the
>> logs of the node that is able to successfully bootstrap.
>> $ grep '10.11.12.13' system.log | grep -v ^DEBUG
>>  INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910)
>> Node /10.11.12.13 is now part of the cluster
>>  INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910)
>> Node /10.11.12.13 is now part of the cluster
>>  INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571
>> OutboundTcpConnection.java (line 386) Handshaking version with
>> /10.11.12.13
>>  INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java
>> (line 876) InetAddress /10.11.12.13 is now UP
>>  INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java
>> (line 876) InetAddress /10.11.12.13 is now UP
>>  INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java
>> (line 876) InetAddress /10.11.12.13 is now UP
>>  INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java
>> (line 876) InetAddress /10.11.12.13 is now UP
>>  INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java
>> (line 876) InetAddress /10.11.12.13 is now UP
>>
>> $ grep '10.11.12.14' system.log | grep -v ^DEBUG
>>  INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910)
>> Node /10.11.12.14 is now part of the cluster
>>  INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910)
>> Node /10.11.12.14 is now part of the cluster
>>
>> The general order of events, as told by JOINING lines:
>> $ grep 'JOINING' system.log
>>  INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009)
>> JOINING: waiting for ring information
>>  INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
>> JOINING: schema complete, ready to bootstrap
>>  INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009)
>> JOINING: waiting for pending range calculation
>>  INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
>> JOINING: calculation complete, ready to bootstrap
>>  INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009)
>> JOINING: getting bootstrap token
>>  INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009)
>> JOINING: sleeping 30000 ms for pending range setup
>>  INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009)
>> JOINING: Starting to bootstrap...
>>
>>
>> Thanks in advance for any suggestions.
>>
>>
>> Peter
>
>

Reply via email to