To be clear, this happened on a 1.1.2 node and it happened again *after* you 
had run a scrub ? 

Has this cluster been around for a while or was the data created with 1.1 ?

Can you confirm that all sstables were re-written for the CF? Check the 
timestamp on the files. Also also files should have the same version, the -h?- 
part of the name.

Can you repair the other CF's ? 

If this cannot be repaired by scrub or upgradetables you may need to cut the 
row out of the sstables. Using sstable2json and json2sstable. 

 
Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 8/07/2012, at 4:05 PM, Michael Theroux wrote:

> Hello,
> 
> We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. 
> Once our replication factor was upped to 3, we ran nodetool repair, and 
> immediately hit an issue on the first node we ran repair on:
> 
>  INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges.
>  INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] new 
> session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, 
> /10.29.187.61 on range 
> (Token(bytes[d5555555555555555555555555555558]),Token(bytes[00000000000000000000000000000000])]
>  for xxxxx.[aaaaa, bbbbb, ccccc, ddddd, eeeee, fffff, ggggg, hhhhh, iiiii, 
> jjjjj, kkkkk, lllll, mmmmm, nnnnn, ooooo, ppppp, qqqqq, rrrrr, sssss]
>  INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting 
> merkle trees for aaaaa (to [/10.29.187.61, 
> xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received 
> merkle tree for aaaaa from /10.29.187.61
>  INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received 
> merkle tree for aaaaa from 
> xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101
>  INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting 
> merkle trees for bbbbb (to [/10.29.187.61, 
> xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Endpoints 
> /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 are 
> consistent for aaaaa
>  INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] aaaaa is 
> fully synced (18 remaining column family to sync for this session)
>  INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received 
> merkle tree for bbbbb from /10.29.187.61
> ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main]
> java.lang.AssertionError: row 
> DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]),
>  efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received 
> out of order wrt 
> DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]),
>  f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb)
>       at 
> org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349)
>       at 
> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712)
>       at 
> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68)
>       at 
> org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438)
>       at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>       at java.util.concurrent.FutureTask.run(Unknown Source)
>       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>       at java.lang.Thread.run(Unknown Source)
> 
> It looks from the log above, the sync of the "aaaaa" column family was 
> successful.  However, the "bbbbb" column family resulted in this error.  In 
> addition, the repair hung after this error.  We ran node tool scrub on all 
> nodes and invalidated the key and row caches and tried again (with RF=2), and 
> it didn't help alleviate the problem.
> 
> Some other important pieces of information:
> We use ByteOrderedPartitioner (we MD5 hash the keys ourselves)
> We're using Leveled Compaction
> As we're in the middle of a transition, one node is on 1.1.2 (the one we 
> tried repair on), the other 5 are on 1.1.1
> 
> Thanks,
> -Mike
> 

Reply via email to