On Wed, Mar 13, 2013 at 12:39 PM, Wei Zhu <wz1...@yahoo.com> wrote: > My guess would be there is some exception during the repair and your session > is aborted. > Here is the code of doing repair: > >https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java > > looking for > > logger.info > > Compare that with your log file, it should give you a rough idea in which > stage repaired died.
Thanks for the link to the source. That's a little hard to grok, but your suggestion to examine the logs more thoroughly was helpful. I was able to determine that repair hung due to connection errors during streaming. I'll include log snippets below, but this leads me to other more important questions... 1. is this a nodetool bug? is there any way to propagate the java.io.IOException back to nodetool? 2. network problems on EC2, I'm shocked! are there recommended network settings for EC2? Dane Here are the relevant logs showing (A) repair progress, and (B) java.io.IOExceptions (A) repair progress INFO [Thread-5314] 2013-03-11 23:29:28,866 StorageService.java (line 2364) Starting repair command #9, repairing 1 ranges for keyspace OpsCenter INFO [AntiEntropySessions:13] 2013-03-11 23:29:28,867 AntiEntropyService.java (line 652) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] new session: will sync /10.34.37.195, /10.82.233.59 on range (0,28356863910078205288614550619314017621] for OpsCenter.[events, rollups60, settings, pdps, rollups86400, events_timeline, rollups300, rollups7200] INFO [Thread-5320] 2013-03-11 23:29:29,198 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] events is fully synced (7 remaining column family to sync for this session) INFO [AntiEntropyStage:1] 2013-03-11 23:38:02,198 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] settings is fully synced (6 remaining column family to sync for this session) INFO [AntiEntropyStage:1] 2013-03-11 23:38:02,617 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] pdps is fully synced (5 remaining column family to sync for this session) INFO [Streaming to /10.82.233.59:34] 2013-03-11 23:38:12,491 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] rollups86400 is fully synced (4 remaining column family to sync for this session) INFO [Streaming to /10.82.233.59:36] 2013-03-11 23:39:55,886 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] rollups7200 is fully synced (3 remaining column family to sync for this session) (B) java.io.IOException # grep -A1 ERROR /var/log/cassandra/system.log.2 ERROR [Streaming to /10.82.233.59:34] 2013-03-11 23:38:12,654 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:34,5,main] java.lang.RuntimeException: java.io.IOException: Connection reset by peer -- ERROR [Streaming to /10.82.233.59:35] 2013-03-11 23:38:12,692 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:35,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe -- ERROR [Streaming to /10.82.233.59:36] 2013-03-11 23:39:55,932 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:36,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe