We have a 4 node 0.7.6 cluster. RF=2 , 3 TB data per node.
A read repair was kicked off on node 4 last week and is still in progress.
Later I kicked of read repair on node 2 a few days back.
We were writing(read/write/updates/NO deletes) data while the repair was in
progress but no data has been written for the past 3-4 days.
I was hoping the repair should get done in that time-frame before proceeding
with further writes/deletes.

Would it be safe to stop it and kick it off per column family or do a full
scan of all keys as suggested in an earlier discussion? Any other suggestion
on hastening this repair.

On both nodes the repair Thread is waiting at this stage for a long
time(~60+ hours)
 java.lang.Thread.State: WAITING
at java.lang.Object.wait(Native Method)
- waiting on <580857f3> (a org.apache.cassandra.utils.SimpleCondition)
at java.lang.Object.wait(Object.java:485)
at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:791)
   Locked ownable synchronizers:
- None
A CPU sampling for few minutes shows these methods as hot spots(mostly the
top two)
org.apache.cassandra.db.ColumnFamilyStore.isKeyInRemainingSSTables( )
org.apache.cassandra.utils.BloomFilter.getHashBuckets( )
org.apache.cassandra.io.sstable.SSTableIdentityIterator.echoData()

netstats does not show anything streaming to/from any of the nodes.

-Adi Pandit

Reply via email to