Re: New nodes won't bootstrap on .66

Marc Canaleta Mon, 08 Nov 2010 08:47:47 -0800

I have just solved the problem removing the second keyspace (manually moving
its column families to the first). So it seems the problem appears when
having multiple keyspaces.


2010/11/8 Thibaut Britz <thibaut.br...@trendiction.com>

> Hi,
>
> No I didn't solve the problem. I reinitialized the cluster and gave each
> node manually a token before adding data. There are a few messages in
> multiple threads related to this, so I suspect it's very common and I hope
> it's gone with 0.7.
>
> Thibaut
>
>
>
>
>
> On Sun, Nov 7, 2010 at 6:57 PM, Marc Canaleta <mcanal...@gmail.com> wrote:
>
>> Hi,
>>
>> Did you solve this problem? I'm having the same poblem. I'm trying to
>> bootstrap a third node in a 0.66 cluster. It has two keyspaces: Keyspace1
>> and KeyspaceLogs, both with replication factor 2.
>>
>> It starts bootstrapping, receives some streams but it keeps waiting for
>> streams. I enabled the debug mode. This lines may be useful:
>>
>> DEBUG [main] 2010-11-07 17:39:50,052 BootStrapper.java (line 70) Beginning
>> bootstrap process
>> DEBUG [main] 2010-11-07 17:39:50,082 StorageService.java (line 160) Added
>> /10.204.93.16/Keyspace1 as a bootstrap source
>> ...
>> DEBUG [main] 2010-11-07 17:39:50,090 StorageService.java (line 160) Added
>> /10.204.93.16/KeyspaceLogs as a bootstrap source
>> ... (streaming mesages)
>> DEBUG [Thread-56] 2010-11-07 17:45:51,706 StorageService.java (line 171)
>> Removed /10.204.93.16/Keyspace1 as a bootstrap source; remaining is [/
>> 10.204.93.16]
>> ...
>> (and never ends).
>>
>> It seems it is waiting for  [/10.204.93.16] when it should be waiting for
>> /10.204.93.16/KeyspaceLogs.
>>
>> The third node is 64 bits, while the two existing nodes are 32 bits. Can
>> this be a problem?
>>
>> Thank you.
>>
>>
>> 2010/10/28 Dimitry Lvovsky <dimi...@reviewpro.com>
>>
>> Maybe your    <StoragePort>7000</StoragePort> is being blocked by iptables
>>> or some firewall or maybe you have it bound (<ListenAddress> tag )  to
>>> localhost instead an ip address.
>>>
>>> Hope this helps,
>>> Dimitry.
>>>
>>>
>>>
>>> On Thu, Oct 28, 2010 at 5:35 PM, Thibaut Britz <
>>> thibaut.br...@trendiction.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have the same problem with 0.6.5
>>>>
>>>> New nodes will hang forever in bootstrap mode (no streams are being
>>>> opened) and the receiver thread just waits for data forever:
>>>>
>>>>
>>>>  INFO [Thread-53] 2010-10-27 20:33:37,399 SSTableReader.java (line 120)
>>>> Sampling index for /hd2/cassandra/data/table_xyz/
>>>> table_xyz-3-Data.db
>>>>  INFO [Thread-53] 2010-10-27 20:33:37,444 StreamCompletionHandler.java
>>>> (line 64) Streaming added /hd2/cassandra/data/table_xyz/table_xyz-3-Data.db
>>>>
>>>> Stacktracke:
>>>>
>>>> "pool-1-thread-53" prio=10 tid=0x00000000412f2800 nid=0x215c runnable
>>>> [0x00007fd7cf217000]
>>>>    java.lang.Thread.State: RUNNABLE
>>>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>>>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>>>         at
>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>         at
>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>>         at
>>>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>         - locked <0x00007fd7e77e0520> (a java.io.BufferedInputStream)
>>>>         at
>>>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126)
>>>>         at
>>>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>>>>         at
>>>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
>>>>         at
>>>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
>>>>         at
>>>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
>>>>         at
>>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1154)
>>>>         at
>>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <aa...@thelastpickle.com
>>>> > wrote:
>>>>
>>>>> The best approach is to manually select the tokens, see the Load
>>>>> Balancing section http://wiki.apache.org/cassandra/Operations Also
>>>>>
>>>>> Are there any log messages in the existing nodes or the new one which
>>>>> mention each other?
>>>>>
>>>>> Is this a production system? Is it still running ?
>>>>>
>>>>> Sorry there is not a lot to go on, it sounds like you've done the right
>>>>> thing. I'm assuming things like the Cluster Name, seed list and port 
>>>>> numbers
>>>>> are set correct as the new node got some data.
>>>>>
>>>>> You'll need to dig through the logs a bit more to see that the boot
>>>>> strapping started and what was the last message it logged.
>>>>>
>>>>> Good Luck.
>>>>> Aaron
>>>>>
>>>>> On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote:
>>>>>
>>>>> Hi Aaron,
>>>>> Thanks for your reply.
>>>>>
>>>>> We still haven't solved this unfortunately.
>>>>>
>>>>>  How did you start the bootstrap for the .18 node ?
>>>>>
>>>>>
>>>>> Standard way: we set "AutoBootstrap" to true and added all the servers
>>>>> from the working ring as seeds.
>>>>>
>>>>>
>>>>>> Was it the .18 or the .17 node you tried to add
>>>>>
>>>>>
>>>>> We first tried adding .17, it streamed for a while, took on a 50GB of
>>>>> load, stopped streaming but then didn't enter into the ring.  We left it 
>>>>> for
>>>>> a few days to see if it would come in, but no luck.  After that we did
>>>>>  decommission and  removeToken ( in that order) operations.
>>>>> Since we couldn't get .17 in we tried again with .18.  Before doing so
>>>>> we increased the RpcTimeoutInMillis from 1000, to 10000 having read that
>>>>> this may cause the problem of nodes not entering into the ring.   It's 
>>>>> been
>>>>> going since friday and still, like .17, won't come into the ring.
>>>>>
>>>>> Does it have a token in the config or did you use nodetool move to set
>>>>>> it
>>>>>
>>>>> No we didn't manually set the token in the config, rather we were
>>>>> relaying on the token to be assigned durring bootstrap from the
>>>>> RandomPartitioner.
>>>>>
>>>>> Again thanks for the help.
>>>>>
>>>>> Dimitry.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <
>>>>> aa...@thelastpickle.com> wrote:
>>>>>
>>>>>> Dimitry, Did you get anywhere with this ?
>>>>>>
>>>>>> Was it the .18 or the .17 node you tried to add ? How did you start
>>>>>> the bootstrap for the .18 node ? Does it have a token in the config or 
>>>>>> did
>>>>>> you use nodetool move to set it?
>>>>>>
>>>>>> I had a quick look at the code AKAIK  the message about removing the
>>>>>> fat client is logged when the node does not have a record of the token 
>>>>>> the
>>>>>> other node as.
>>>>>>
>>>>>> Aaron
>>>>>>
>>>>>> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimi...@reviewpro.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi All,
>>>>>> We recently upgraded from .65 to .66 after which we tried adding a new
>>>>>> node to our cluster. We left it bootstrapping and after 3 days, it still
>>>>>> refused to join the ring. The strange thing is that nodetool info shows 
>>>>>> 50GB
>>>>>> of load and nodetool ring shows that it sees the rest of ring, which it 
>>>>>> is
>>>>>> not part of. We tried the process again with another server -- again the
>>>>>> same thing as before:
>>>>>>
>>>>>>
>>>>>> //from machine 192.168.218
>>>>>>
>>>>>>
>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 info
>>>>>> 131373516047318302934572185119435768941
>>>>>> Load : 52.85 GB
>>>>>> Generation No : 1287761987
>>>>>> Uptime (seconds) : 323157
>>>>>> Heap Memory (MB) : 795.42 / 1945.63
>>>>>>
>>>>>>
>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring
>>>>>> Address Status Load Range Ring
>>>>>> 158573510920250391466717289405976537674
>>>>>> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202 |<--|
>>>>>> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055 | |
>>>>>> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065 | |
>>>>>> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674 |-->|
>>>>>>
>>>>>> opt/cassandra/bin/nodetool -h localhost -p 8999 streams
>>>>>> Mode: Bootstrapping
>>>>>> Not sending any streams.
>>>>>> Not receiving any streams.
>>>>>>
>>>>>>
>>>>>> Whats more, while looking at the log of one of the nodes I see gossip
>>>>>> messages from 192.168.1.17 -- the first node we tried to add to the 
>>>>>> cluster
>>>>>> but which is not running at the the time of the log message:
>>>>>> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406)
>>>>>> FatClient /192.168.2.17 has been silent for 3600000ms, removing from
>>>>>> gossip
>>>>>> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591) Node /
>>>>>> 192.168.2.17 is now part of the cluster
>>>>>>
>>>>>>
>>>>>> Thanks in advance for the help,
>>>>>> Dimitry
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dimitry Lvovsky
>>>>> Director of Engineering
>>>>> ReviewPro
>>>>> www.reviewpro.com
>>>>> +34 616 337 103
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitry Lvovsky
>>> Director of Engineering
>>> ReviewPro
>>> www.reviewpro.com
>>> +34 616 337 103
>>>
>>
>>
>

Re: New nodes won't bootstrap on .66

Reply via email to