We have a 4 node 0.7.6 cluster. RF=2 , 3 TB data per node. A read repair was kicked off on node 4 last week and is still in progress. Later I kicked of read repair on node 2 a few days back. We were writing(read/write/updates/NO deletes) data while the repair was in progress but no data has been written for the past 3-4 days. I was hoping the repair should get done in that time-frame before proceeding with further writes/deletes.
Would it be safe to stop it and kick it off per column family or do a full scan of all keys as suggested in an earlier discussion? Any other suggestion on hastening this repair. On both nodes the repair Thread is waiting at this stage for a long time(~60+ hours) java.lang.Thread.State: WAITING at java.lang.Object.wait(Native Method) - waiting on <580857f3> (a org.apache.cassandra.utils.SimpleCondition) at java.lang.Object.wait(Object.java:485) at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38) at org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:791) Locked ownable synchronizers: - None A CPU sampling for few minutes shows these methods as hot spots(mostly the top two) org.apache.cassandra.db.ColumnFamilyStore.isKeyInRemainingSSTables( ) org.apache.cassandra.utils.BloomFilter.getHashBuckets( ) org.apache.cassandra.io.sstable.SSTableIdentityIterator.echoData() netstats does not show anything streaming to/from any of the nodes. -Adi Pandit