When we reboot the problematic node, we see the following errors in system.log.
1. Does this mean hints column family is corrupted? 2. Can we scrub system column family on problematic node and its replication partners? 3. How do we rebuild System keyspace? ================================================================== ERROR [CompactionExecutor:950] 2015-06-27 20:11:44,595 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:950,1,main] java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [HintedHandoff:552] 2015-06-27 20:11:44,595 CassandraDaemon.java (line 191) Exception in thread Thread[HintedHandoff:552,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:436) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:282) at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:90) at org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:502) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source) at java.util.concurrent.FutureTask.get(Unknown Source) at org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:432) ... 6 more Caused by: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) ================================================================== On Wed, Jul 1, 2015 at 11:59 AM, Shashi Yachavaram <shashi...@gmail.com> wrote: > We have a 28 node cluster, out of which only one node is experiencing > timeouts. > We thought it was the raid, but there are two other nodes on the same raid > without > any problem. Also The problem goes away if we reboot the node, and then > reappears > after seven days. The following hinted hand-off timeouts are seen on the > node > experiencing the timeouts. Also we did not notice any gossip errors. > > I was wondering if anyone has seen this issue and how they resolved it. > > Cassandra Version: 1.2.15.1 > OS: Linux cm 2.6.32-504.8.1.el6.x86_64 #1 SMP Fri Dec 19 12:09:25 EST 2014 > x86_64 x86_64 x86_64 GNU/Linux > java version "1.6.0_85" > > > ------------------------------------------------------------------------------------------------------------------------------------ > INFO [HintedHandoff:2] 2015-06-17 22:52:08,130 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > 4fe86051-6bca-4c28-b09c-1b0f073c1588 with IP: /192.168.1.122 > INFO [HintedHandoff:1] 2015-06-17 22:52:08,131 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > bbf0878b-b405-4518-b649-f6cf7c9a6550 with IP: /192.168.1.119 > INFO [HintedHandoff:2] 2015-06-17 22:52:17,634 HintedHandOffManager.java > (line 422) Timed out replaying hints to /192.168.1.122; aborting (0 > delivered) > INFO [HintedHandoff:2] 2015-06-17 22:52:17,635 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > f7b7ab10-4d42-4f0c-af92-2934a075bee3 with IP: /192.168.1.108 > INFO [HintedHandoff:1] 2015-06-17 22:52:17,643 HintedHandOffManager.java > (line 422) Timed out replaying hints to /192.168.1.119; aborting (0 > delivered) > INFO [HintedHandoff:1] 2015-06-17 22:52:17,643 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > ddb79f35-3e2b-4be8-84d8-7942086e2b73 with IP: /192.168.1.104 > INFO [HintedHandoff:2] 2015-06-17 22:52:27,143 HintedHandOffManager.java > (line 422) Timed out replaying hints to /192.168.1.108; aborting (0 > delivered) > INFO [HintedHandoff:2] 2015-06-17 22:52:27,144 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > 6a2fa431-4a51-44cb-af19-1991c960e075 with IP: /192.168.1.117 > INFO [HintedHandoff:1] 2015-06-17 22:52:27,153 HintedHandOffManager.java > (line 422) Timed out replaying hints to /192.168.1.104; aborting (0 > delivered) > INFO [HintedHandoff:1] 2015-06-17 22:52:27,154 HintedHandOffManager.java > (line 296) Started hinted handoff for host: > cf03174a-533c-44d6-a679-e70090ad2bc5 with IP: /192.168.1.107 > > ------------------------------------------------------------------------------------------------------------------------------------ > > Thanks > -shashi.. >