in a 5 node cluster, i noticed in our client error log that one of the nodes was consistently throwing cassandra_UnavailableException during a read operation.
looking into jmx, it was obvious that one of the node's view of the ring was out of sync. $ nodetool -host 192.168.20.150 ring Address Status Load Range Ring 139508497374977076191526400448759597506 192.168.20.156Up 5.73 GB 733665530305941485083898696792520436 |<--| 192.168.20.158Up 3.41 GB 9629533262984150011756238989685472219 | ^ 192.168.20.154Up 2.44 GB 31048334058970902242412812423471654868 v | 192.168.20.150Up 4.89 GB 105769574715070648260922426249777160699 | ^ 192.168.20.152Up 5.24 GB 139508497374977076191526400448759597506 |-->| $ nodetool -host 192.168.20.158 ring Address Status Load Range Ring 192.168.20.158Up 3.41 GB 9629533262984150011756238989685472219 |<--| looking at the CF stats on that node, it is obvious that reads and writes are happening, but i have to assume that those are coming from proxy connections via the other nodes. when restarting that node, the error logs in the other cluster nodes show that they detect the server going away and then coming back into the ring. INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448 OutboundTcpConnection.java (line 102) error writing to /192.168.20.158 INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475 OutboundTcpConnection.java (line 102) error writing to /192.168.20.158 INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node /192.168.20.158 has restarted, now UP again INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538) Node /192.168.20.158 state jump to normal any ideas on how to kick that node and remind it of its buddies? thanks! -keith