This is a follow-up to my previous post “Cassandra taking snapshots automatically?”. I’ve renamed the thread to better describe the new information I’ve discovered.
I have a four node, RF=3, 2.0.11 cluster that was producing snapshots at a prodigious rate. I let the cluster sit idle overnight to settle down, and deleted all the snapshots. I waited for a while to make sure it really was done creating snapshots. I then ran "nodetool repair test2_browse” on one node and immediately got snapshots on three of my four nodes. Here’s what my /var/lib/cassandra/data/test2_browse/path_by_parent/snapshots directory looks like after a few minutes: drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33adb6b0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33aea110-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33af6460-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33b027b0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33b0c3f0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33b1ae50-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 33b24a90-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2d1300-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2daf40-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2e4b80-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2ee7c0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2f5cf0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:09 3a2ff930-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 40bbb190-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 40bc74e0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 40bd1120-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 40bdd470-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 40be70b0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 474b3a80-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 474c24e0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 474d3650-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 4dd9d910-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 4ddac370-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 546877a0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 54696200-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 546a7370-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 546b36c0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:10 5af73d40-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 60ee7dd0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 60ef4120-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 60efdd60-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 60f0a0b0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 60f16400-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 677d1c60-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 677e06c0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 6e0bbaf0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 749a5980-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 7b28f810-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:11 81b796a0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12 87ae3af0-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12 87aed730-7bbf-11e4-893d-d96c3e745723 drwxr-xr-x 2 cassandra cassandra 40960 Dec 4 07:12 87af7370-7bbf-11e4-893d-d96c3e745723 I also get lots of events like these in system.log: ERROR [AntiEntropySessions:1] 2014-12-03 13:35:40,541 CassandraDaemon.java (line 199) Exception in thread Thread[AntiEntropySessions:1,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation. at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323) at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126) at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160) ... 3 more Does anybody have any idea what might cause this? That it happens at all is bizarre, and that it happens on only three nodes is even more bizarre. Also, it really doesn’t seem to have difficulty creating snapshots, so the snapshot failure creation errors are quite a mystery. And while we’re talking repairs, I have some questions about monitoring them. Even when not running an explicit repair, I randomly see repair tasks in OpsCenter. They usually only last a few seconds, and the progress percentage often goes into the quadruple digits. When I run repair using nodetool, it takes several hours, but again, all I ever see in OpsCenter are these random, short-lived repair tasks. Is there any way to monitor repairs? I frequently see posts about stalled repairs. How do you know a repair has stalled when you can’t see it? And, how do you know if a repair actually succeeded or not? Thanks in advance Robert