It seems that we did not have the JMX ports (1024+) opened in our firewall.  
Once we opened ports 1024+ the hinted handoffs completed and it seems that the 
cluster went back to normal.
Does that make sense?

Thanks,
Dan

This is what we saw in the logs after opening the ports:

INFO [HintedHandoff:1] 2013-05-05 14:52:41,925 ColumnFamilyStore.java (line 
659) Enqueuing flush of Memtable-HintsColumnFamily@726541064(33313153/41641441 
serialized/live bytes, 18009 ops)
 INFO [FlushWriter:4] 2013-05-05 14:52:41,926 Memtable.java (line 264) Writing 
Memtable-HintsColumnFamily@726541064(33313153/41641441 serialized/live bytes, 
18009 ops)
 INFO [FlushWriter:4] 2013-05-05 14:52:42,961 Memtable.java (line 305) 
Completed flushing 
/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db
 (33344642 bytes) for commitlog position 
ReplayPosition(segmentId=1367725930067, position=12449833)
 INFO [CompactionExecutor:16] 2013-05-05 14:52:42,969 CompactionTask.java (line 
109) Compacting 
[SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db'),
 
SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-9-Data.db')]
 INFO [HintedHandoff:1] 2013-05-05 14:52:43,419 HintedHandOffManager.java (line 
390) Finished hinted handoff of 7945 rows to endpoint /107.20.45.6


-----Original Message-----
From: Dan Kogan [mailto:d...@iqtell.com] 
Sent: Sunday, May 05, 2013 8:24 AM
To: user@cassandra.apache.org
Subject: Node went down and came back up

Hello,

Last night one of our nodes froze and the server had to be rebooted.  After it 
came up, the node joined the ring and everything looked normal.
However, this morning there seem to be some inconsistencies in the data (e.g. 
some nodes don't have a given record or have a different version of the record 
than other node).

There are also a lot of messages about hinted handoff in the logs that started 
after the node failure.
Like these:

INFO [HintedHandoff:1] 2013-05-05 11:22:23,339 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 56713727820156410577229101238628035242 
with IP: /107.20.45.6  INFO [HintedHandoff:1] 2013-05-05 11:22:33,343 
HintedHandOffManager.java (line 372) Timed out replaying hints to /107.20.45.6; 
aborting further deliveries  INFO [HintedHandoff:1] 2013-05-05 11:22:33,344 
HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to 
endpoint /107.20.45.6  INFO [HintedHandoff:1] 2013-05-05 11:22:33,344 
HintedHandOffManager.java (line 294) Started hinted handoff for token: 0 with 
IP: /67.202.15.178  INFO [HintedHandoff:1] 2013-05-05 11:22:43,348 
HintedHandOffManager.java (line 372) Timed out replaying hints to 
/67.202.15.178; aborting further deliveries  INFO [HintedHandoff:1] 2013-05-05 
11:22:43,348 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 
rows to endpoint /67.202.15.178

Do we need to run repair on all nodes to get the cluster back to "normal" state?

Thanks for the help.

Dan Kogan

Reply via email to