We didn't solve it unfortunately and and ended up regenerating the entire cluster. But, if it helps anyone in the future, we too had multiple keyspaces when we encountered the problem.
On Mon, Nov 8, 2010 at 5:47 PM, Marc Canaleta <mcanal...@gmail.com> wrote: > I have just solved the problem removing the second keyspace (manually > moving its column families to the first). So it seems the problem appears > when having multiple keyspaces. > > 2010/11/8 Thibaut Britz <thibaut.br...@trendiction.com> > > Hi, >> >> No I didn't solve the problem. I reinitialized the cluster and gave each >> node manually a token before adding data. There are a few messages in >> multiple threads related to this, so I suspect it's very common and I hope >> it's gone with 0.7. >> >> Thibaut >> >> >> >> >> >> On Sun, Nov 7, 2010 at 6:57 PM, Marc Canaleta <mcanal...@gmail.com>wrote: >> >>> Hi, >>> >>> Did you solve this problem? I'm having the same poblem. I'm trying to >>> bootstrap a third node in a 0.66 cluster. It has two keyspaces: Keyspace1 >>> and KeyspaceLogs, both with replication factor 2. >>> >>> It starts bootstrapping, receives some streams but it keeps waiting for >>> streams. I enabled the debug mode. This lines may be useful: >>> >>> DEBUG [main] 2010-11-07 17:39:50,052 BootStrapper.java (line 70) >>> Beginning bootstrap process >>> DEBUG [main] 2010-11-07 17:39:50,082 StorageService.java (line 160) Added >>> /10.204.93.16/Keyspace1 as a bootstrap source >>> ... >>> DEBUG [main] 2010-11-07 17:39:50,090 StorageService.java (line 160) Added >>> /10.204.93.16/KeyspaceLogs as a bootstrap source >>> ... (streaming mesages) >>> DEBUG [Thread-56] 2010-11-07 17:45:51,706 StorageService.java (line 171) >>> Removed /10.204.93.16/Keyspace1 as a bootstrap source; remaining is [/ >>> 10.204.93.16] >>> ... >>> (and never ends). >>> >>> It seems it is waiting for [/10.204.93.16] when it should be waiting >>> for /10.204.93.16/KeyspaceLogs. >>> >>> The third node is 64 bits, while the two existing nodes are 32 bits. Can >>> this be a problem? >>> >>> Thank you. >>> >>> >>> 2010/10/28 Dimitry Lvovsky <dimi...@reviewpro.com> >>> >>> Maybe your <StoragePort>7000</StoragePort> is being blocked by >>>> iptables or some firewall or maybe you have it bound (<ListenAddress> tag ) >>>> to localhost instead an ip address. >>>> >>>> Hope this helps, >>>> Dimitry. >>>> >>>> >>>> >>>> On Thu, Oct 28, 2010 at 5:35 PM, Thibaut Britz < >>>> thibaut.br...@trendiction.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have the same problem with 0.6.5 >>>>> >>>>> New nodes will hang forever in bootstrap mode (no streams are being >>>>> opened) and the receiver thread just waits for data forever: >>>>> >>>>> >>>>> INFO [Thread-53] 2010-10-27 20:33:37,399 SSTableReader.java (line 120) >>>>> Sampling index for /hd2/cassandra/data/table_xyz/ >>>>> table_xyz-3-Data.db >>>>> INFO [Thread-53] 2010-10-27 20:33:37,444 StreamCompletionHandler.java >>>>> (line 64) Streaming added >>>>> /hd2/cassandra/data/table_xyz/table_xyz-3-Data.db >>>>> >>>>> Stacktracke: >>>>> >>>>> "pool-1-thread-53" prio=10 tid=0x00000000412f2800 nid=0x215c runnable >>>>> [0x00007fd7cf217000] >>>>> java.lang.Thread.State: RUNNABLE >>>>> at java.net.SocketInputStream.socketRead0(Native Method) >>>>> at java.net.SocketInputStream.read(SocketInputStream.java:129) >>>>> at >>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >>>>> at >>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258) >>>>> at >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>>>> - locked <0x00007fd7e77e0520> (a java.io.BufferedInputStream) >>>>> at >>>>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126) >>>>> at >>>>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) >>>>> at >>>>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314) >>>>> at >>>>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262) >>>>> at >>>>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192) >>>>> at >>>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1154) >>>>> at >>>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Oct 28, 2010 at 12:44 PM, aaron morton < >>>>> aa...@thelastpickle.com> wrote: >>>>> >>>>>> The best approach is to manually select the tokens, see the Load >>>>>> Balancing section http://wiki.apache.org/cassandra/Operations Also >>>>>> >>>>>> Are there any log messages in the existing nodes or the new one which >>>>>> mention each other? >>>>>> >>>>>> Is this a production system? Is it still running ? >>>>>> >>>>>> Sorry there is not a lot to go on, it sounds like you've done the >>>>>> right thing. I'm assuming things like the Cluster Name, seed list and >>>>>> port >>>>>> numbers are set correct as the new node got some data. >>>>>> >>>>>> You'll need to dig through the logs a bit more to see that the boot >>>>>> strapping started and what was the last message it logged. >>>>>> >>>>>> Good Luck. >>>>>> Aaron >>>>>> >>>>>> On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote: >>>>>> >>>>>> Hi Aaron, >>>>>> Thanks for your reply. >>>>>> >>>>>> We still haven't solved this unfortunately. >>>>>> >>>>>> How did you start the bootstrap for the .18 node ? >>>>>> >>>>>> >>>>>> Standard way: we set "AutoBootstrap" to true and added all the servers >>>>>> from the working ring as seeds. >>>>>> >>>>>> >>>>>>> Was it the .18 or the .17 node you tried to add >>>>>> >>>>>> >>>>>> We first tried adding .17, it streamed for a while, took on a 50GB of >>>>>> load, stopped streaming but then didn't enter into the ring. We left it >>>>>> for >>>>>> a few days to see if it would come in, but no luck. After that we did >>>>>> decommission and removeToken ( in that order) operations. >>>>>> Since we couldn't get .17 in we tried again with .18. Before doing so >>>>>> we increased the RpcTimeoutInMillis from 1000, to 10000 having read that >>>>>> this may cause the problem of nodes not entering into the ring. It's >>>>>> been >>>>>> going since friday and still, like .17, won't come into the ring. >>>>>> >>>>>> Does it have a token in the config or did you use nodetool move to set >>>>>>> it >>>>>> >>>>>> No we didn't manually set the token in the config, rather we were >>>>>> relaying on the token to be assigned durring bootstrap from the >>>>>> RandomPartitioner. >>>>>> >>>>>> Again thanks for the help. >>>>>> >>>>>> Dimitry. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton < >>>>>> aa...@thelastpickle.com> wrote: >>>>>> >>>>>>> Dimitry, Did you get anywhere with this ? >>>>>>> >>>>>>> Was it the .18 or the .17 node you tried to add ? How did you start >>>>>>> the bootstrap for the .18 node ? Does it have a token in the config or >>>>>>> did >>>>>>> you use nodetool move to set it? >>>>>>> >>>>>>> I had a quick look at the code AKAIK the message about removing the >>>>>>> fat client is logged when the node does not have a record of the token >>>>>>> the >>>>>>> other node as. >>>>>>> >>>>>>> Aaron >>>>>>> >>>>>>> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimi...@reviewpro.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> We recently upgraded from .65 to .66 after which we tried adding a >>>>>>> new node to our cluster. We left it bootstrapping and after 3 days, it >>>>>>> still >>>>>>> refused to join the ring. The strange thing is that nodetool info shows >>>>>>> 50GB >>>>>>> of load and nodetool ring shows that it sees the rest of ring, which it >>>>>>> is >>>>>>> not part of. We tried the process again with another server -- again the >>>>>>> same thing as before: >>>>>>> >>>>>>> >>>>>>> //from machine 192.168.218 >>>>>>> >>>>>>> >>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 info >>>>>>> 131373516047318302934572185119435768941 >>>>>>> Load : 52.85 GB >>>>>>> Generation No : 1287761987 >>>>>>> Uptime (seconds) : 323157 >>>>>>> Heap Memory (MB) : 795.42 / 1945.63 >>>>>>> >>>>>>> >>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring >>>>>>> Address Status Load Range Ring >>>>>>> 158573510920250391466717289405976537674 >>>>>>> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202 |<--| >>>>>>> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055 | | >>>>>>> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065 | | >>>>>>> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674 >>>>>>> |-->| >>>>>>> >>>>>>> opt/cassandra/bin/nodetool -h localhost -p 8999 streams >>>>>>> Mode: Bootstrapping >>>>>>> Not sending any streams. >>>>>>> Not receiving any streams. >>>>>>> >>>>>>> >>>>>>> Whats more, while looking at the log of one of the nodes I see gossip >>>>>>> messages from 192.168.1.17 -- the first node we tried to add to the >>>>>>> cluster >>>>>>> but which is not running at the the time of the log message: >>>>>>> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406) >>>>>>> FatClient /192.168.2.17 has been silent for 3600000ms, removing from >>>>>>> gossip >>>>>>> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591) Node / >>>>>>> 192.168.2.17 is now part of the cluster >>>>>>> >>>>>>> >>>>>>> Thanks in advance for the help, >>>>>>> Dimitry >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Dimitry Lvovsky >>>>>> Director of Engineering >>>>>> ReviewPro >>>>>> www.reviewpro.com >>>>>> +34 616 337 103 >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Dimitry Lvovsky >>>> Director of Engineering >>>> ReviewPro >>>> www.reviewpro.com >>>> +34 616 337 103 >>>> >>> >>> >> > -- Dimitry Lvovsky Director of Engineering ReviewPro www.reviewpro.com +34 616 337 103