Re: nodetool repair stalled

Paolo Crosato Tue, 14 Jan 2014 01:04:48 -0800

I was able to complete the repair, repairing one keyspace and cf each time.

However the last session is still shown as an active process, even ifthe session has been successfully completed, this is the log:

INFO [CompactionExecutor:252] 2014-01-14 03:10:13,105CompactionTask.java (line 275) Compacted 12 sstables to[/data/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-9492,].1,371 bytes to 42 (~3% of original) in 56ms = 0.000715MB/s. 13 totalpartitions merged to 1. Partition merge counts were {1:1, 2:6, }INFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,750StreamResultFuture.java (line 181) [Stream#6cf54d20-7cbf-11e3-a6c2-a1357a0d9222] Session with /10.255.235.19 iscompleteINFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,750StreamResultFuture.java (line 215) [Stream#6cf54d20-7cbf-11e3-a6c2-a1357a0d9222] All sessions completedINFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,751StreamingRepairTask.java (line 96) [repair#02f3f620-7cbe-11e3-a6c2-a1357a0d9222] streaming task succeed, returningresponse to /10.255.235.18INFO [AntiEntropyStage:1] 2014-01-14 03:11:40,751 RepairSession.java(line 214) [repair #02f3f620-7cbe-11e3-a6c2-a1357a0d9222] positions isfully syncedINFO [AntiEntropySessions:161] 2014-01-14 03:11:40,751RepairSession.java (line 274) [repair#02f3f620-7cbe-11e3-a6c2-a1357a0d9222] session completed successfully


This is what ps -eaf |grep java shows:

500 25488 25459 0 Jan13 ? 00:00:43 /usr/bin/java -cp/etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar-Xmx32m -Dlog4j.configuration=log4j-tools.properties-Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd-p 7199 repair tiergast positions


Is this a known bug?

Regards,

Paolo Crosato

Il 13/01/2014 10:25, Paolo Crosato ha scritto:

Hi,
I rebooted the nodes and started a fresh repair session. The repairsession was started on node 1.
This time actually I got this error on the node that started the repair:
ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,360RepairSession.java (line 278) [repair#728f4860-79d3-11e3-8c98-a1357a0d9222] session completed with thefollowing errororg.apache.cassandra.exceptions.RepairException: [repair#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300,(4515884230644880127,4556138740897423021]] Sync failed between/10.255.235.18 and /10.255.235.19atorg.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)atorg.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)atorg.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)atorg.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,399CassandraDaemon.java (line 187) Exception in threadThread[AntiEntropySessions:2,5,RMI Runtime]java.lang.RuntimeException:org.apache.cassandra.exceptions.RepairException: [repair#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300,(4515884230644880127,4556138740897423021]] Sync failed between/10.255.235.18 and /10.255.235.19
    at com.google.common.base.Throwables.propagate(Throwables.java:160)
atorg.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)atjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300,(4515884230644880127,4556138740897423021]] Sync failed between/10.255.235.18 and /10.255.235.19atorg.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)atorg.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)atorg.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)atorg.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
    ... 3 more

On the other node i left some black lines between these timestamps:
INFO [ValidationExecutor:3] 2014-01-10 09:42:41,320 SSTableReader.java(line 223) Opening/data/cassandra/data/OpsCenter/rollups60/snapshots/29e4d5d0-79d3-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-11522(88 bytes)
INFO [ValidationExecutor:14] 2014-01-10 10:37:48,509SSTableReader.java (line 223) Opening/data/cassandra/data/OpsCenter/rollups60/snapshots/d5176b00-79da-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-16275(493003 b
Between I have many log files full of "Opening ...." logs.
I've noticed the repair sessions seems always to hang on the opscenterkeyspace. Would uninstall/reinstall help resolve the issue?
Anyway, I attached the logs for the nodes involved, I'm sorry if thereis a lot of noise.
Thanks for any input.

Regards,

Paolo Crosato

Il 09/01/2014 03:54, sankalp kohli ha scritto:
Hi,
Can you attach the logs around repair. Please do that for nodewhich triggered it and nodes involved in repair. I will try to findsomething useful.
Thanks,
Sankalp
On Wed, Jan 8, 2014 at 10:18 AM, Robert Coli <rc...@eventbrite.com<mailto:rc...@eventbrite.com>> wrote:
    On Wed, Jan 8, 2014 at 8:52 AM, Paolo Crosato
    <paolo.cros...@targaubiest.com
    <mailto:paolo.cros...@targaubiest.com>> wrote:

        I have two nodes with Cassandra 2.0.3, where repair sessions
        hang for an undefinite time. I'm running nodetool repair once
        a week on every node, on different days. Currently I have
        like 4 repair sessions running on each node, one since 3
        weeks and none has finished.
        Reading the logs I didn't find any exception, apparently one
        of the repair session got stuck at this command:

        Has anybody any suggestion on why a nodetool repair might be
        stuck and how to debug it?


    Cassandra repair has never quite worked right. It got a wholesale
    re-write in 2.0.x and "should" be more robust and at very least
    log more than before. But unfortunately I have heard a few
    reports like yours, so it is probably not completely fixed.

    That said, that only option you have for failed repairs seems to
    be to restart the affected nodes. Your input as an operator of
    2.0.x who would appreciate an alternative is welcome at :

    https://issues.apache.org/jira/browse/CASSANDRA-3486

    =Rob

Re: nodetool repair stalled

Reply via email to