We were having some occasional memory pressure issues, but we just added some more RAM a few days ago to the nodes and things are running more smoothly now, but in general nodes have not been going up and down.
I tried to do a "list HintsColumnFamily" from Cassandra-cli and it locks my Cassandra node and never returns, forcing me to kill the Cassandra process and restart it to get the node back. Here is my settings which I believe are default since I don't remember changing them: hinted_handoff_enabled: true max_hint_window_in_ms: 3600000 # one hour hinted_handoff_throttle_delay_in_ms: 50 Greping for Hinted in system log I get these INFO [HintedHandoff:1] 2012-03-13 16:13:22,215 HintedHandOffManager.java (line 373) Finished hinted handoff of 852703 rows to endpoint /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 284) Endpoint /192.168.20.4 died before hint delivery, aborting INFO [ScheduledTasks:1] 2012-03-13 16:15:32,569 StatusLogger.java (line 65) HintedHandoff 1 1 0 INFO [HintedHandoff:1] 2012-03-13 16:15:44,362 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 16:21:37,266 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [ScheduledTasks:1] 2012-03-13 16:23:07,662 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:25:49,330 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:30:52,503 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:42:22,202 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 ColumnFamilyStore.java (line 704) Enqueuing flush of Memtable-HintsColumnFamily@661547256(34298224/74465815 serialized/live bytes, 78808 ops) INFO [HintedHandoff:1] 2012-03-13 17:11:00,098 HintedHandOffManager.java (line 373) Finished hinted handoff of 44160 rows to endpoint /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 17:11:36,596 HintedHandOffManager.java (line 296) Started hinted handoff for token: 56713727820156407428984779325531226112 with IP: /192.168.20.4 INFO [ScheduledTasks:1] 2012-03-13 17:12:25,248 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-13 18:47:56,151 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [ScheduledTasks:1] 2012-03-13 18:50:24,326 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:12:48,177 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:13:57,685 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:14:57,258 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:14:58,260 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:15:59,093 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:16:59,428 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:18:01,862 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:18:01,898 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:19:04,527 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:19:04,541 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:20:07,712 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:20:08,332 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-14 12:27:13,033 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [ScheduledTasks:1] 2012-03-15 15:05:00,954 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-15 15:06:07,750 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries INFO [ScheduledTasks:1] 2012-03-15 15:06:07,802 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-15 15:06:07,809 ColumnFamilyStore.java (line 704) Enqueuing flush of Memtable-HintsColumnFamily@254668880(103911/8312880 serialized/live bytes, 63877 ops) INFO [ScheduledTasks:1] 2012-03-15 15:07:13,503 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-15 15:15:43,842 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 From: aaron morton [mailto:aa...@thelastpickle.com]<mailto:[mailto:aa...@thelastpickle.com]> Sent: Thursday, March 15, 2012 1:51 AM To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> Subject: Re: Large hints column family Is there anything going on in the logs ? Are nodes going up and down ? Can you see any messages about delivering hints ? If the query to read the hints errors it will log "HintsCF getEPPendingHints timed out" at INFO level. Also checking, do the hinted_handoff_* settings in cassandra.yaml have their default settings ? Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/03/2012, at 8:35 AM, Bryce Godfrey wrote: Forgot to mention that this is on 1.0.8 From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com]<mailto:[mailto:bryce.godf...@azaleos.com]> Sent: Wednesday, March 14, 2012 12:34 PM To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> Subject: Large hints column family The system HintsColumnFamily seems large in my cluster, and I want to track down why that is. I try invoking "listEndpointsPendingHints()" for o.a.c.db.HintedHandoffManager and it never returns, and also freezes the node that its invoked against. It's a 3 node cluster, and all nodes have been up and running without issue for a while. Any help on where to start with this? Column Family: HintsColumnFamily SSTable count: 11 Space used (live): 11271669539 Space used (total): 11271669539 Number of Keys (estimate): 1408 Memtable Columns Count: 338 Memtable Data Size: 0 Memtable Switch Count: 1 Read Count: 3 Read Latency: 4354.669 ms. Write Count: 848 Write Latency: 0.029 ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 12656 Key cache capacity: 14 Key cache size: 11 Key cache hit rate: 0.6666666666666666 Row cache: disabled Compacted row minimum size: 105779 Compacted row maximum size: 7152383774 Compacted row mean size: 590818614 Thanks, Bryce