in a 5 node cluster, i noticed in our client error log that one of the
nodes was consistently throwing cassandra_UnavailableException during
a read operation.

looking into jmx, it was obvious that one of the node's view of the
ring was out of sync.

$ nodetool -host 192.168.20.150 ring
Address       Status     Load          Range
           Ring

139508497374977076191526400448759597506
192.168.20.156Up         5.73 GB
733665530305941485083898696792520436       |<--|
192.168.20.158Up         3.41 GB
9629533262984150011756238989685472219      |   ^
192.168.20.154Up         2.44 GB
31048334058970902242412812423471654868     v   |
192.168.20.150Up         4.89 GB
105769574715070648260922426249777160699    |   ^
192.168.20.152Up         5.24 GB
139508497374977076191526400448759597506    |-->|

$ nodetool -host 192.168.20.158 ring
Address       Status     Load          Range
           Ring
192.168.20.158Up         3.41 GB
9629533262984150011756238989685472219      |<--|

looking at the CF stats on that node, it is obvious that reads and
writes are happening, but i have to assume that those are coming from
proxy connections via the other nodes.

when restarting that node, the error logs in the other cluster nodes
show that they detect the server going away and then coming back into
the ring.

INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448
OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475
OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node
/192.168.20.158 has restarted, now UP again
INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538)
Node /192.168.20.158 state jump to normal

any ideas on how to kick that node and remind it of its buddies?

thanks!
-keith

Reply via email to