Hello. I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried turning off two of the four nodes for half an hour under heavy load. As a result I got a large volume of hints on the alive nodes - HintsColumnFamily takes about 1.5 GB disk space on each of the nodes. It seems, these hints are never replayed successfully.
After I bring other nodes back online, tpstats shows active handoffs, but I can't see any writes on the target nodes. The log indicates memory pressure - the heap is >80% full (heap size is 8GB total, 1GB young). A fragment of the log: INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: / 84.201.162.144 INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760 used; max is 8588951552 INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744 used; max is 8588951552 INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392 used; max is 8588951552 INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712 used; max is 8588951552 WARN 18:34:09,805 Heap is 0.7978131149667941 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to relieve memory pressure INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops) INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops) INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480 used; max is 8588951552 INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056 used; max is 8588951552 INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728 used; max is 8588951552 INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936 used; max is 8588951552 INFO 18:34:14,691 Completed flushing /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes) INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248 used; max is 8588951552 INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead. INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections, 3774714808 used; max is 8588951552 INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP After taking off the load and restatring the service, I still see pending handoffs: $ nodetool -h localhost tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1004257 0 0 RequestResponseStage 0 0 92555 0 0 MutationStage 0 0 6 0 0 ReadRepairStage 0 0 57773 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 143332 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage 0 0 0 0 0 MemtablePostFlusher 0 0 2 0 0 StreamStage 0 0 0 0 0 FlushWriter 0 0 2 0 0 MiscStage 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 1 3 15 0 0 These 3 handoffs remain pending for a long time (>12 hours). Most of the time Cassandra uses 100% of one CPU core, the stack trace of the busy thread is: "HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843 runnable [0x00007fa1e1146000] java.lang.Thread.State: RUNNABLE at java.util.ArrayList$Itr.remove(ArrayList.java:808) at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908) at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857) at org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256) at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84) at org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Heap usage is also rather high, though the node isn't doing anything, except the HH processing. Here is CMS output: 2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744 K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs] [1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs] [Times: user=0.28 sys=0.00, real=0.02 secs] Eventually, after a few service restarts, the hints suddenly disappear. Probably, the TTL expires and the hints get compacted away. Currently my best guess is the following. Hinted handoffs are stored as supercolumns, with one row per target node. The service tries to read them entirely into memory for replay and fails, because the volume is too large to fit in the heap at once. Then the TTL expires, and the service starts to delete old subcolumns during read. Since the underlying storage is a huge ArrayList, the deletion is inefficient and takes forever. So, it seems there're two problems here. 1) Hints are not paged correctly and cause significant memory pressure - that's actually strange, since the same issue was supposedly addressed in https://issues.apache.org/jira/browse/CASSANDRA-1327 and https://issues.apache.org/jira/browse/CASSANDRA-3624; 2) Deletion of outdated hints doesn't work well for large hint volumes. Any suggestions on how to make the cluster more tolerant to downtimes? If I turn off the hinted handoff entirely, and manually run a repair after a downtime, will it restore all the data correctly? -- Best regards, Vladimir