Hi,
10 nodes cassandra 1.0.3, several DC. weekly nodetool repair stuck for
unusual long time for node 10.254.237.2.
output log on this node:
INFO 11:19:42,045 Starting repair command #1, repairing 5 ranges.
INFO 11:19:42,053 [repair #040aae00-28a1-11e1-0000-e378018944ff] new
session: will sync _*localhost/10.254.237.2, /10.254.221.2, /10.253.2.2,
/10.254.217.2, /10.254.94.2*_ on range
(85070591730234615865843651857942052864,85070591730234615865843651857942052865]
for meter.[eventschema, schema, ids, transaction]
INFO 11:19:42,055 [repair #040aae00-28a1-11e1-0000-e378018944ff]
requests for merkle tree sent for eventschema (to [/10.253.2.2,
/10.254.221.2, localhost/10.254.237.2, /10.254.217.2, /10.254.94.2])
INFO 11:19:42,063 Enqueuing flush of
Memtable-eventschema@1509399856(18748/23435 serialized/live bytes, 4 ops)
INFO 11:19:42,063 Writing Memtable-eventschema@1509399856(18748/23435
serialized/live bytes, 4 ops)
INFO 11:19:42,072 Completed flushing
/spool1/cassandra/data/meter/eventschema-hb-40-Data.db (4745 bytes)
INFO 11:19:42,073 Discarding obsolete commit
log:CommitLogSegment(/var/lib/cassandra/commitlog/CommitLog-1324019623060.log)
INFO 11:19:42,076 [repair #040aae00-28a1-11e1-0000-e378018944ff]
Received merkle tree for eventschema from localhost/10.254.237.2
INFO 11:19:42,102 [repair #040aae00-28a1-11e1-0000-e378018944ff]
Received merkle tree for eventschema from /10.254.221.2
INFO 11:19:42,128 [repair #040aae00-28a1-11e1-0000-e378018944ff]
Received merkle tree for eventschema from /10.254.217.2
INFO 11:19:42,228 [repair #040aae00-28a1-11e1-0000-e378018944ff]
Received merkle tree for eventschema from /10.253.2.2
And nothing after that for long time. So node sent request for trees to
other nodes and received all but from the 10.254.94.2_*
*_On that 10.254.94.2 node:
INFO 11:19:42,083 [repair #040aae00-28a1-11e1-0000-e378018944ff] Sending
completed merkle tree to /10.254.237.2 for (meter,eventschema)
So merkle tree were lost somewhere. Will this waiting break somehow or I
need to restart node?