This is a follow-up to my previous post “Cassandra taking snapshots
automatically?”. I’ve renamed the thread to better describe the new information
I’ve discovered.
I have a four node, RF=3, 2.0.11 cluster that was producing snapshots at a
prodigious rate. I let the cluster sit idle overnight to settle down, and
deleted all the snapshots. I waited for a while to make sure it really was done
creating snapshots. I then ran "nodetool repair test2_browse” on one node and
immediately got snapshots on three of my four nodes. Here’s what my
/var/lib/cassandra/data/test2_browse/path_by_parent/snapshots directory looks
like after a few minutes:
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33adb6b0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33aea110-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33af6460-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33b027b0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33b0c3f0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33b1ae50-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
33b24a90-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2d1300-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2daf40-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2e4b80-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2ee7c0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2f5cf0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09
3a2ff930-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
40bbb190-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
40bc74e0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
40bd1120-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
40bdd470-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
40be70b0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
474b3a80-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
474c24e0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
474d3650-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
4dd9d910-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
4ddac370-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
546877a0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
54696200-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
546a7370-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
546b36c0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10
5af73d40-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
60ee7dd0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
60ef4120-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
60efdd60-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
60f0a0b0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
60f16400-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
677d1c60-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
677e06c0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
6e0bbaf0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
749a5980-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
7b28f810-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11
81b796a0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12
87ae3af0-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12
87aed730-7bbf-11e4-893d-d96c3e745723
drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12
87af7370-7bbf-11e4-893d-d96c3e745723
I also get lots of events like these in system.log:
ERROR [AntiEntropySessions:1] 2014-12-03 13:35:40,541 CassandraDaemon.java
(line 199) Exception in thread Thread[AntiEntropySessions:1,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Failed during snapshot
creation.
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Failed during snapshot creation.
at
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323)
at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126)
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
... 3 more
Does anybody have any idea what might cause this? That it happens at all is
bizarre, and that it happens on only three nodes is even more bizarre. Also, it
really doesn’t seem to have difficulty creating snapshots, so the snapshot
failure creation errors are quite a mystery.
And while we’re talking repairs, I have some questions about monitoring them.
Even when not running an explicit repair, I randomly see repair tasks in
OpsCenter. They usually only last a few seconds, and the progress percentage
often goes into the quadruple digits. When I run repair using nodetool, it
takes several hours, but again, all I ever see in OpsCenter are these random,
short-lived repair tasks. Is there any way to monitor repairs? I frequently see
posts about stalled repairs. How do you know a repair has stalled when you
can’t see it? And, how do you know if a repair actually succeeded or not?
Thanks in advance
Robert