To be clear, this happened on a 1.1.2 node and it happened again *after* you had run a scrub ?
Has this cluster been around for a while or was the data created with 1.1 ? Can you confirm that all sstables were re-written for the CF? Check the timestamp on the files. Also also files should have the same version, the -h?- part of the name. Can you repair the other CF's ? If this cannot be repaired by scrub or upgradetables you may need to cut the row out of the sstables. Using sstable2json and json2sstable. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/07/2012, at 4:05 PM, Michael Theroux wrote: > Hello, > > We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. > Once our replication factor was upped to 3, we ran nodetool repair, and > immediately hit an issue on the first node we ran repair on: > > INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges. > INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] new > session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, > /10.29.187.61 on range > (Token(bytes[d5555555555555555555555555555558]),Token(bytes[00000000000000000000000000000000])] > for xxxxx.[aaaaa, bbbbb, ccccc, ddddd, eeeee, fffff, ggggg, hhhhh, iiiii, > jjjjj, kkkkk, lllll, mmmmm, nnnnn, ooooo, ppppp, qqqqq, rrrrr, sssss] > INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting > merkle trees for aaaaa (to [/10.29.187.61, > xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101]) > INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received > merkle tree for aaaaa from /10.29.187.61 > INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received > merkle tree for aaaaa from > xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101 > INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting > merkle trees for bbbbb (to [/10.29.187.61, > xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101]) > INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Endpoints > /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 are > consistent for aaaaa > INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] aaaaa is > fully synced (18 remaining column family to sync for this session) > INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received > merkle tree for bbbbb from /10.29.187.61 > ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main] > java.lang.AssertionError: row > DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]), > efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received > out of order wrt > DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]), > f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb) > at > org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349) > at > org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712) > at > org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68) > at > org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438) > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > It looks from the log above, the sync of the "aaaaa" column family was > successful. However, the "bbbbb" column family resulted in this error. In > addition, the repair hung after this error. We ran node tool scrub on all > nodes and invalidated the key and row caches and tried again (with RF=2), and > it didn't help alleviate the problem. > > Some other important pieces of information: > We use ByteOrderedPartitioner (we MD5 hash the keys ourselves) > We're using Leveled Compaction > As we're in the middle of a transition, one node is on 1.1.2 (the one we > tried repair on), the other 5 are on 1.1.1 > > Thanks, > -Mike >