It seems that we did not have the JMX ports (1024+) opened in our firewall. Once we opened ports 1024+ the hinted handoffs completed and it seems that the cluster went back to normal. Does that make sense?
Thanks, Dan This is what we saw in the logs after opening the ports: INFO [HintedHandoff:1] 2013-05-05 14:52:41,925 ColumnFamilyStore.java (line 659) Enqueuing flush of Memtable-HintsColumnFamily@726541064(33313153/41641441 serialized/live bytes, 18009 ops) INFO [FlushWriter:4] 2013-05-05 14:52:41,926 Memtable.java (line 264) Writing Memtable-HintsColumnFamily@726541064(33313153/41641441 serialized/live bytes, 18009 ops) INFO [FlushWriter:4] 2013-05-05 14:52:42,961 Memtable.java (line 305) Completed flushing /data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db (33344642 bytes) for commitlog position ReplayPosition(segmentId=1367725930067, position=12449833) INFO [CompactionExecutor:16] 2013-05-05 14:52:42,969 CompactionTask.java (line 109) Compacting [SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db'), SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-9-Data.db')] INFO [HintedHandoff:1] 2013-05-05 14:52:43,419 HintedHandOffManager.java (line 390) Finished hinted handoff of 7945 rows to endpoint /107.20.45.6 -----Original Message----- From: Dan Kogan [mailto:d...@iqtell.com] Sent: Sunday, May 05, 2013 8:24 AM To: user@cassandra.apache.org Subject: Node went down and came back up Hello, Last night one of our nodes froze and the server had to be rebooted. After it came up, the node joined the ring and everything looked normal. However, this morning there seem to be some inconsistencies in the data (e.g. some nodes don't have a given record or have a different version of the record than other node). There are also a lot of messages about hinted handoff in the logs that started after the node failure. Like these: INFO [HintedHandoff:1] 2013-05-05 11:22:23,339 HintedHandOffManager.java (line 294) Started hinted handoff for token: 56713727820156410577229101238628035242 with IP: /107.20.45.6 INFO [HintedHandoff:1] 2013-05-05 11:22:33,343 HintedHandOffManager.java (line 372) Timed out replaying hints to /107.20.45.6; aborting further deliveries INFO [HintedHandoff:1] 2013-05-05 11:22:33,344 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /107.20.45.6 INFO [HintedHandoff:1] 2013-05-05 11:22:33,344 HintedHandOffManager.java (line 294) Started hinted handoff for token: 0 with IP: /67.202.15.178 INFO [HintedHandoff:1] 2013-05-05 11:22:43,348 HintedHandOffManager.java (line 372) Timed out replaying hints to /67.202.15.178; aborting further deliveries INFO [HintedHandoff:1] 2013-05-05 11:22:43,348 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /67.202.15.178 Do we need to run repair on all nodes to get the cluster back to "normal" state? Thanks for the help. Dan Kogan