I'm running into a quirky issue with Brisk 1.0 Beta 2 (w/ Cassandra 0.8.1).
I think the last node in our cluster is having problems (10.201.x.x). OpsCenter and nodetool ring (run from that node) show the node as down, but the rest of the cluster sees it as up. If I run nodetool ring from one of the first 11 nodes, I get this output... everything is up: ubuntu@ip-10-85-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost ring Address DC Rack Status State Load Owns Token 148873535527910577765226390751398592512 10.2.x.x DC1 RAC1 Up Normal 901.57 GB 12.50% 0 10.116.x.x DC2 RAC1 Up Normal 258.22 GB 6.25% 10633823966279326983230456482242756608 10.110.x.x DC1 RAC1 Up Normal 129.07 GB 6.25% 21267647932558653966460912964485513216 10.2.x.x DC1 RAC1 Up Normal 128.5 GB 12.50% 42535295865117307932921825928971026432 10.114.x.x DC2 RAC1 Up Normal 257.31 GB 6.25% 53169119831396634916152282411213783040 10.210.x.x DC1 RAC1 Up Normal 128.66 GB 6.25% 63802943797675961899382738893456539648 10.207.x.x DC1 RAC2 Up Normal 643.12 GB 12.50% 85070591730234615865843651857942052864 10.85.x.x DC2 RAC1 Up Normal 256.76 GB 6.25% 95704415696513942849074108340184809472 10.2.x.x DC1 RAC2 Up Normal 128.95 GB 6.25% 106338239662793269832304564822427566080 10.96.x.x DC1 RAC2 Up Normal 128.29 GB 12.50% 127605887595351923798765477786913079296 10.194.x.x DC2 RAC1 Up Normal 257.14 GB 6.25% 138239711561631250781995934269155835904 10.201.x.x DC1 RAC2 Up Normal 129.45 GB 6.25% 148873535527910577765226390751398592512 However, OpsCenter shows the last node (10.201.x.x) as unresponsive: http://blueplastic.com/accenture/unresponsive.PNG And if I try to run nodetool ring from the 10.201.x.x node, I get connection errors like this: ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost ring Error connection to remote JMX agent! java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out] at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:141) at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:111) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:559) Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out] at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101) tpstats command also didn't work: ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost tpstats Error connection to remote JMX agent! java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.Co nnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out] at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338) Looking at what's listening on 7199 on the node shows a bunch of results: ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ sudo netstat -anp | grep 7199 tcp 0 0 0.0.0.0:7199 0.0.0.0:* LISTEN 1459/java tcp 8 0 10.194.x.x:7199 10.2.x.x:40135 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:49835 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:55087 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:49837 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:55647 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:49833 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:52935 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:52940 CLOSE_WAIT - tcp 8 0 10.194.x.x:7199 10.2.x.x:40141 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:52936 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:55646 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:39098 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:39095 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:55086 CLOSE_WAIT - tcp 8 0 127.0.0.1:7199 127.0.0.1:50575 CLOSE_WAIT - [list truncated, there are about 20 more lines] The /var/log/cassandra dir shows that none of the system.log files were touched in the last two days. The system.log.1 file's tail shows: FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,120 Configuration.java (line 1256) error parsing conf file: java.io.FileNotFoundException: /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,121 Configuration.java (line 1256) error parsing conf file: java.io.FileNotFoundException: /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,123 Configuration.java (line 1256) error parsing conf file: java.io.FileNotFoundException: /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,124 Configuration.java (line 1256) error parsing conf file: java.io.FileNotFoundException: /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) Also, this node might be going through a memory leak or spinning tread. Check out the top output from it (specifically the CPU & MEM): http://blueplastic.com/accenture/top.PNG Anything else I can do to troubleshoot this? Is this a known issue that I can just ignore and reboot the node? - Sameer