Hey Aaron, > Are there any log messages in the existing nodes or the new one which > mention each other?
>From the currently running nodes we get the message that the new is up: INFO [GMFD:1] 2010-10-22 10:22:49,232 Gossiper.java (line 591) Node / 192.168.2.18 is now part of the cluster INFO [GMFD:1] 2010-10-22 10:22:49,616 Gossiper.java (line 583) InetAddress /192.168.2.18 is now UP INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:22:49,629 HintedHandOffManager.java (line 172) Started hinted handoff for endPoint / 192.168.2.18 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:22:49,677 HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to endpoint /192.168.2.18 INFO [STREAM-STAGE:1] 2010-10-22 10:25:12,058 StreamOut.java (line 132) Sending a stream initiate message to /192.168.2.18 ... INFO [STREAM-STAGE:1] 2010-10-22 10:25:12,058 StreamOut.java (line 137) Waiting for transfer to /192.168.2.18 to complete As well .18 sees the other nodes with log messages such as: INFO [GMFD:1] 2010-10-22 13:45:17,729 Gossiper.java (line 597) Node / 192.168.2.23 has restarted, now UP again INFO [GMFD:1] 2010-10-22 14:06:04,342 Gossiper.java (line 597) Node / 192.168.2.20 has restarted, now UP again INFO [GMFD:1] 2010-10-22 15:22:08,326 Gossiper.java (line 597) Node / 192.168.2.21 has restarted, now UP again INFO [GMFD:1] 2010-10-22 17:31:01,819 Gossiper.java (line 597) Node / 192.168.2.22 has restarted, now UP again Is this a production system? Is it still running ? Yep, it is a production systems and its still up. You'll need to dig through the logs a bit more to see that the boot > strapping started and what was the last message it logged. Just going to dump some here log in case you see anything that stands out: ..... INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:19:33,479 HintedHandOffManager.java (line 172) Started hinted handoff for endPoint / 192.168.2.21 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:19:33,480 HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to endpoint /192.168.2.21 INFO [main] 2010-10-22 10:20:58,700 StorageService.java (line 391) Joining: getting bootstrap token INFO [main] 2010-10-22 10:20:58,756 BootStrapper.java (line 108) New token will be 131382354792524396082927748799616801083 to assume load from / 192.168.2.21 INFO [main] 2010-10-22 10:20:58,757 StorageService.java (line 391) Joining: sleeping 30000 ms for pending range setup INFO [main] 2010-10-22 10:21:28,757 StorageService.java (line 391) Bootstrapping .... INFO [Thread-84] 2010-10-22 13:42:13,136 SSTableReader.java (line 125) Sampling index and loading saved keyCache for /var/data.... INFO [Thread-84] 2010-10-22 13:42:13,182 StreamCompletionHandler.java (line 64) Streaming added /var/data/.... .... INFO [SSTABLE-CLEANUP-TIMER] 2010-10-22 13:43:11,568 SSTableDeletingReference.java (line 107) Deleted /var/data/... INFO [WRITE-/192.168.2.23] 2010-10-22 13:43:38,670 OutboundTcpConnection.java (line 103) error writing to /192.168.2.23 INFO [Timer-0] 2010-10-22 13:43:40,670 Gossiper.java (line 180) InetAddress /192.168.2.23 is now dead. INFO [GMFD:1] 2010-10-22 13:45:17,729 Gossiper.java (line 597) Node / 192.168.2.23 has restarted, now UP again INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 13:45:17,730 HintedHandOffManager.java (line 172) Started hinted handoff for endPoint / 192.168.2.23 INFO [GMFD:1] 2010-10-22 13:45:17,730 StorageService.java (line 569) Node / 192.168.2.23 state jump to normal INFO [GMFD:1] 2010-10-22 13:45:17,731 StorageService.java (line 576) Will not change my token ownership to /192.168.2.23 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 13:45:17,879 HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to endpoint /192.168.2.23 INFO [Timer-0] 2010-10-22 14:04:25,722 Gossiper.java (line 180) InetAddress /192.168.2.20 is now dead. INFO [GMFD:1] 2010-10-22 14:06:04,342 Gossiper.java (line 597) Node / 192.168.2.20 has restarted, now UP again INFO [GMFD:1] 2010-10-22 14:06:04,342 StorageService.java (line 569) Node / 192.168.2.20 state jump to normal INFO [GMFD:1] 2010-10-22 14:06:04,343 StorageService.java (line 576) Will not change my token ownership to /192.168.2.20 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 14:06:04,344 HintedHandOffManager.java (line 172) Started hinted handoff for endPoint / 192.168.2.20 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 14:06:04,579 ColumnFamilyStore.java (line 470) HintsColumnFamily has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/data/cassandra/commitlog/CommitLog-1287747593217.log', position=9825007) .... Thanks for the help, and pardon the log file vomit above. Dimitry. On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <aa...@thelastpickle.com>wrote: > The best approach is to manually select the tokens, see the Load Balancing > section http://wiki.apache.org/cassandra/Operations Also > > Are there any log messages in the existing nodes or the new one which > mention each other? > > Is this a production system? Is it still running ? > > Sorry there is not a lot to go on, it sounds like you've done the right > thing. I'm assuming things like the Cluster Name, seed list and port numbers > are set correct as the new node got some data. > > You'll need to dig through the logs a bit more to see that the boot > strapping started and what was the last message it logged. > > Good Luck. > Aaron > > On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote: > > Hi Aaron, > Thanks for your reply. > > We still haven't solved this unfortunately. > > How did you start the bootstrap for the .18 node ? > > > Standard way: we set "AutoBootstrap" to true and added all the servers from > the working ring as seeds. > > >> Was it the .18 or the .17 node you tried to add > > > We first tried adding .17, it streamed for a while, took on a 50GB of load, > stopped streaming but then didn't enter into the ring. We left it for a few > days to see if it would come in, but no luck. After that we did > decommission and removeToken ( in that order) operations. > Since we couldn't get .17 in we tried again with .18. Before doing so we > increased the RpcTimeoutInMillis from 1000, to 10000 having read that this > may cause the problem of nodes not entering into the ring. It's been going > since friday and still, like .17, won't come into the ring. > > Does it have a token in the config or did you use nodetool move to set it > > No we didn't manually set the token in the config, rather we were relaying > on the token to be assigned durring bootstrap from the RandomPartitioner. > > Again thanks for the help. > > Dimitry. > > > > On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <aa...@thelastpickle.com>wrote: > >> Dimitry, Did you get anywhere with this ? >> >> Was it the .18 or the .17 node you tried to add ? How did you start the >> bootstrap for the .18 node ? Does it have a token in the config or did you >> use nodetool move to set it? >> >> I had a quick look at the code AKAIK the message about removing the fat >> client is logged when the node does not have a record of the token the other >> node as. >> >> Aaron >> >> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimi...@reviewpro.com> >> wrote: >> >> Hi All, >> We recently upgraded from .65 to .66 after which we tried adding a new >> node to our cluster. We left it bootstrapping and after 3 days, it still >> refused to join the ring. The strange thing is that nodetool info shows 50GB >> of load and nodetool ring shows that it sees the rest of ring, which it is >> not part of. We tried the process again with another server -- again the >> same thing as before: >> >> >> //from machine 192.168.218 >> >> >> /opt/cassandra/bin/nodetool -h localhost -p 8999 info >> 131373516047318302934572185119435768941 >> Load : 52.85 GB >> Generation No : 1287761987 >> Uptime (seconds) : 323157 >> Heap Memory (MB) : 795.42 / 1945.63 >> >> >> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring >> Address Status Load Range Ring >> 158573510920250391466717289405976537674 >> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202 |<--| >> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055 | | >> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065 | | >> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674 |-->| >> >> opt/cassandra/bin/nodetool -h localhost -p 8999 streams >> Mode: Bootstrapping >> Not sending any streams. >> Not receiving any streams. >> >> >> Whats more, while looking at the log of one of the nodes I see gossip >> messages from 192.168.1.17 -- the first node we tried to add to the cluster >> but which is not running at the the time of the log message: >> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406) FatClient >> /192.168.2.17 has been silent for 3600000ms, removing from gossip >> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591) Node / >> 192.168.2.17 is now part of the cluster >> >> >> Thanks in advance for the help, >> Dimitry >> >> > > > -- > Dimitry Lvovsky > Director of Engineering > ReviewPro > www.reviewpro.com > +34 616 337 103 > > > -- Dimitry Lvovsky Director of Engineering ReviewPro www.reviewpro.com +34 616 337 103