Could you verify any security settings that may come into play with Elastic IPs? You should make sure the appropriate ports are open.
See: http://www.datastax.com/docs/0.8/brisk/install_brisk_ami for a list of ports in the first chart. Joaquin Casares DataStax Software Engineer/Support On Mon, Jun 20, 2011 at 7:41 PM, Sameer Farooqui <cassandral...@gmail.com>wrote: > Quick update... > > I'm trying to get a 3-node cluster defined the following way in the > topology.properties file to work first: > 10.68.x.x=DC1:RAC1 > 10.198.x.x=DC1:RAC2 > 10.204.x.x=DC1:RAC3 > > I'll split up the 3rd node into a separate data center later. > > Also, ignore that comment I made about the $BRISK_HOME/lib/ folder not > existing. When you run ANT, I believe it populates correctly, but I'll have > to confirm/test later. > > Based on Joaquin @ DataStax's suggestion, I tried changing the Seed IP in > all 3 nodes' YAML file to the Amazon Private IP, instead of the Elastic IP. > After this change, all three nodes joined the ring correctly: > > ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/cassandra/conf$ > ../bin/nodetool -h localhost ring > Address Status State Load Owns Token > > 113427455640312821154458202477256070485 > 10.68.x.x Up Normal 10.9 KB 33.33% 0 > 10.198.x.x Up Normal 15.21 KB 33.33% > 56713727820156410577229101238628035242 > 10.204.x.x Up Normal 6.55 KB 33.33% > 113427455640312821154458202477256070485 > > PasteBin is down and is showing me a diligent cat typing on a keyboard, so > I uploaded some relevant DEBUG level log files here: > > http://blueplastic.com/accenture/N1-system-seed_is_ElasticIP.log (problem > exists) > http://blueplastic.com/accenture/N2-system-seed_is_ElasticIP.log (problem > exists) > > http://blueplastic.com/accenture/N1-system-seed_is_privateIP.log(everything > works) > http://blueplastic.com/accenture/N2-system-seed_is_privateIP.log(everything > works) > > > But if I want to set up the Brisk cluster across Amazon regions, I have to > be able to use the Elastic IP for the seed. Also, using v 0.7.4 of Cassandra > in Amazon, we successfully set up a 30+ node cluster using 3 seed nodes > which were declared in the YAML file using Elastic IPs. All 30 nodes were in > the same region and availability zone. So, in an older version of Cassandra, > providing the Seeds as Elastic IP used to work. > > In my current setup, even though nodes 1 & 2 are in the same region & > availability zone, I can't seem to get them to join the same ring correctly. > > > Here is what the system log file shows when I declare the Seed using > Elastic IP: > INFO [Thread-4] 2011-06-21 00:10:30,849 BriskDaemon.java (line 187) > Listening for thrift clients... > DEBUG [GossipTasks:1] 2011-06-21 00:10:31,608 Gossiper.java (line 201) > Assuming current protocol version for /50.17.x.x > DEBUG [WRITE-/50.17.212.84] 2011-06-21 00:10:31,610 > OutboundTcpConnection.java (line 161) attempting to connect to /50.17.x.x > DEBUG [GossipTasks:1] 2011-06-21 00:10:32,610 Gossiper.java (line 201) > Assuming current protocol version for /50.17.x.x > DEBUG [ScheduledTasks:1] 2011-06-21 00:10:32,613 StorageLoadBalancer.java > (line 334) Disseminating load info ... > DEBUG [GossipTasks:1] 2011-06-21 00:10:33,611 Gossiper.java (line 201) > Assuming current protocol version for /50.17.x.x > DEBUG [GossipTasks:1] 2011-06-21 00:10:34,612 Gossiper.java (line 201) > Assuming current protocol version for /50.17.x.x > > > But when I use private IP, the log shows: > > INFO [Thread-4] 2011-06-21 00:19:47,993 BriskDaemon.java (line 187) > Listening for thrift clients... > DEBUG [ScheduledTasks:1] 2011-06-21 00:19:49,769 StorageLoadBalancer.java > (line 334) Disseminating load info ... > DEBUG [WRITE-/10.198.126.193] 2011-06-21 00:20:09,658 > OutboundTcpConnection.java (line 161) attempting to connect to /10.198.x.x > INFO [GossipStage:1] 2011-06-21 00:20:09,690 Gossiper.java (line 637) Node > /10.198.x.x is now part of the cluster > DEBUG [GossipStage:1] 2011-06-21 00:20:09,691 MessagingService.java (line > 158) Resetting pool for /10.198.x.x > INFO [GossipStage:1] 2011-06-21 00:20:09,691 Gossiper.java (line 605) > InetAddress /10.198.x.x is now UP > DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java > (line 282) Checking remote schema before delivering hints > DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java > (line 274) schema for /10.198.x.x matches local schema > DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java > (line 288) Sleeping 11662ms to stagger hint delivery > > - Sameer > > > On Mon, Jun 20, 2011 at 2:28 PM, Sameer Farooqui > <cassandral...@gmail.com>wrote: > >> Hi, >> >> I'm setting up a 3 node test cluster in multiple Amazon Availability Zones >> to test cross-zone internode communication (and eventually cross-region >> communications). >> >> But I wanted to start with a cross-zone setup and am having trouble >> getting the nodes to connect to each other and join one 3-node ring. All >> nodes just seem to join their own ring and claim 100% of that space. >> >> I'm using this Beta2 distribution of Brisk: >> http://debian.datastax.com/maverick/pool/brisk_1.0~beta1.2.tar.gz >> >> I had to manually recreate the $BRISK_HOME/lib/ folder because it didn't >> exist in the binary for some reason and I also added jna and mx4j jar files >> to the lib directory. >> >> The cluster is geographically located like this: >> >> Node 1 (seed): East-A >> Node 2: East-A >> Node 3: East-B >> >> The cassandra-topology.properties file on all three nodes contains this: >> >> # Cassandra Node IP=Data Center:Rack >> 10.68.x.x=DC1:RAC1 >> 10.198.x.x=DC1:RAC2 >> 10.204.x.x=DC2:RAC1 >> default=DC1:RAC1 >> >> >> and finally, here is what the relevant sections of the YAML file looks >> like for each node: >> >> ++ Node 1 ++ >> cluster_name: 'Test Cluster' >> initial_token: 0 >> auto_bootstrap: false >> partitioner: org.apache.cassandra.dht.RandomPartitioner >> - seeds: 50.17.x.x #This is the elastic IP for Node 1 >> listen_address: 10.68.x.x >> rpc_address: 0.0.0.0 >> endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch >> encryption_options: >> internode_encryption: none >> >> ++ Node 2 ++ >> cluster_name: 'Test Cluster' >> initial_token: 56713727820156410577229101238628035242 >> auto_bootstrap: true >> partitioner: org.apache.cassandra.dht.RandomPartitioner >> - seeds: 50.17.x.x #This is the elastic IP for Node 1 >> listen_address: 10.198.x.x >> rpc_address: 0.0.0.0 >> endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch >> encryption_options: >> internode_encryption: none >> >> ++ Node 3 ++ >> cluster_name: 'Test Cluster' >> initial_token: 113427455640312821154458202477256070485 >> auto_bootstrap: true >> partitioner: org.apache.cassandra.dht.RandomPartitioner >> - seeds: 50.17.x.x #This is the elastic IP for Node 1 >> listen_address: 10.204.x.x >> rpc_address: 0.0.0.0 >> endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch >> encryption_options: >> internode_encryption: none >> >> >> When I start Cassandra on all three nodes using "sudo bin/brisk >> cassandra", the startup log doesn't show any warnings or errors. The end of >> the start log on Node1 says: >> INFO [main] 2011-06-20 21:06:57,702 MessagingService.java (line 201) >> Starting Messaging Service on port 7000 >> INFO [main] 2011-06-20 21:06:57,723 StorageService.java (line 482) Using >> saved token 0 >> INFO [main] 2011-06-20 21:06:57,724 ColumnFamilyStore.java (line 1011) >> Enqueuing flush of Memtable-LocationInfo@1260987126(38/47 serialized/live >> bytes, 2 ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:57,724 Memtable.java (line 237) >> Writing Memtable-LocationInfo@1260987126(38/47 serialized/live bytes, 2 >> ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:57,809 Memtable.java (line 254) >> Completed flushing /raiddrive/data/system/LocationInfo-g-12-Data.db (148 >> bytes) >> INFO [CompactionExecutor:2] 2011-06-20 21:06:57,812 >> CompactionManager.java (line 539) Compacting Major: >> [SSTableReader(path='/raiddrive/data/system/LocationInfo-g-9-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-11-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-10-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-12-Data.db')] >> INFO [CompactionExecutor:2] 2011-06-20 21:06:57,828 >> CompactionIterator.java (line 186) Major@1110828771(system, LocationInfo, >> 429/808) now compacting at 16777 bytes/ms. >> INFO [main] 2011-06-20 21:06:57,881 Mx4jTool.java (line 67) mx4j >> successfuly loaded >> INFO [CompactionExecutor:2] 2011-06-20 21:06:57,909 >> CompactionManager.java (line 603) Compacted to >> /raiddrive/data/system/LocationInfo-tmp-g-13-Data.db. 808 to 432 (~53% of >> original) bytes for 3 keys. Time: 97ms. >> INFO [main] 2011-06-20 21:06:57,953 BriskDaemon.java (line 146) Binding >> thrift service to /0.0.0.0:9160 >> INFO [main] 2011-06-20 21:06:57,955 BriskDaemon.java (line 160) Using >> TFastFramedTransport with a max frame size of 15728640 bytes. >> INFO [Thread-4] 2011-06-20 21:06:57,958 BriskDaemon.java (line 187) >> Listening for thrift clients... >> >> >> And the end of the log on node 2 says: >> INFO [main] 2011-06-20 21:06:57,899 StorageService.java (line 368) >> Cassandra version: 0.8.0-beta2-SNAPSHOT >> INFO [main] 2011-06-20 21:06:57,901 StorageService.java (line 369) Thrift >> API version: 19.10.0 >> INFO [main] 2011-06-20 21:06:57,901 StorageService.java (line 382) >> Loading persisted ring state >> INFO [main] 2011-06-20 21:06:57,904 StorageService.java (line 418) >> Starting up server gossip >> INFO [main] 2011-06-20 21:06:57,915 ColumnFamilyStore.java (line 1011) >> Enqueuing flush of Memtable-LocationInfo@885597447(29/36 serialized/live >> bytes, 1 ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:57,916 Memtable.java (line 237) >> Writing Memtable-LocationInfo@885597447(29/36 serialized/live bytes, 1 >> ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:57,990 Memtable.java (line 254) >> Completed flushing /raiddrive/data/system/LocationInfo-g-8-Data.db (80 >> bytes) >> INFO [CompactionExecutor:1] 2011-06-20 21:06:58,000 >> CompactionManager.java (line 539) Compacting Major: >> [SSTableReader(path='/raiddrive/data/system/LocationInfo-g-6-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-8-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-7-Data.db'), >> SSTableReader(path='/raiddrive/data/system/LocationInfo-g-5-Data.db')] >> INFO [main] 2011-06-20 21:06:58,007 MessagingService.java (line 201) >> Starting Messaging Service on port 7000 >> INFO [CompactionExecutor:1] 2011-06-20 21:06:58,015 >> CompactionIterator.java (line 186) Major@291813814(system, LocationInfo, >> 467/770) now compacting at 16777 bytes/ms. >> INFO [main] 2011-06-20 21:06:58,032 StorageService.java (line 482) Using >> saved token 56713727820156410577229101238628035242 >> INFO [main] 2011-06-20 21:06:58,033 ColumnFamilyStore.java (line 1011) >> Enqueuing flush of Memtable-LocationInfo@934909150(53/66 serialized/live >> bytes, 2 ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:58,033 Memtable.java (line 237) >> Writing Memtable-LocationInfo@934909150(53/66 serialized/live bytes, 2 >> ops) >> INFO [FlushWriter:1] 2011-06-20 21:06:58,157 Memtable.java (line 254) >> Completed flushing /raiddrive/data/system/LocationInfo-g-10-Data.db (163 >> bytes) >> INFO [CompactionExecutor:1] 2011-06-20 21:06:58,169 >> CompactionManager.java (line 603) Compacted to >> /raiddrive/data/system/LocationInfo-tmp-g-9-Data.db. 770 to 447 (~58% of >> original) bytes for 3 keys. Time: 168ms. >> INFO [main] 2011-06-20 21:06:58,206 Mx4jTool.java (line 67) mx4j >> successfuly loaded >> INFO [main] 2011-06-20 21:06:58,249 BriskDaemon.java (line 146) Binding >> thrift service to /0.0.0.0:9160 >> INFO [main] 2011-06-20 21:06:58,252 BriskDaemon.java (line 160) Using >> TFastFramedTransport with a max frame size of 15728640 bytes. >> INFO [Thread-4] 2011-06-20 21:06:58,254 BriskDaemon.java (line 187) >> Listening for thrift clients... >> >> >> Running nodetool ring on Node1 shows: >> ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/cassandra$ bin/nodetool >> -h localhost ring >> Address Status State Load Owns Token >> 10.68.x.x Up Normal 10.9 KB 100.00% 0 >> >> nodetool ring on Node2 shows: >> ubuntu@domU-12-31-39-10-x-x:~/brisk-1.0~beta1.2/resources/cassandra$ >> bin/nodetool -h localhost ring >> Address Status State Load Owns Token >> >> 10.198.x.x Up Normal 15.21 KB 100.00% >> 56713727820156410577229101238628035242 >> >> >> I have also tried placing all three nodes in the same data center, like >> this, with no luck: >> 10.68.x.x=DC1:RAC1 >> 10.198.x.x=DC1:RAC2 >> 10.204.x.x=DC1:RAC3 >> >> After the above change, all nodes still join their own ring and take claim >> of 100% of the ring. Here is the full startup log for when just one data >> center is specified in the topology.properties file: >> >> Node 1: http://pastebin.com/Vzy2u9WB >> Node 2: http://pastebin.com/rqGy5Asy >> >> On a side note, I have also tried switching the snitch in the YAML file on >> all three nodes to BriskSimpleSnitch. The problem persists where the nodes >> still don't join the same ring and the same symptoms are exhibited. So, I'm >> guessing the problem is not necessarily the snitch, but something else? >> >> I can ping all three nodes from each other and the following ports are >> open between the nodes: ICMP, TCP 1024-65535, 7000, 7199, 8012, 8888 >> >> >> Questions: >> >> 1) What am I doing wrong that's preventing the nodes from seeing each >> other and joining 1 ring? What should I look at more closely to troubleshoot >> this? >> >> 2) Would it help to troubleshoot this if I turn on DEBUG logging for >> Cassandra and then restart the "bin/brisk cassandra" service? >> > >