Hello, We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry of query failures during the night corresponding with our backup schedule. More specifically, our logs suggest that calling "nodetool snapshot" on a node is triggering 12 to 16 second CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which the node is marked dead by the ring until re-joining shortly after.
Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1 At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot. (After investigating, we noticed that calling both flush and snapshot is unnecessary, and have since updated the script to only call snapshot). While writing memtables, we'll generally see a GC logged out via Cassandra such as: "GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is 4424663040." In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection, resulting in a full stop-the-world GC like this (different node): 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10, real=6.46 secs] (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 3958389K->1700629K(4185792K), [CMS Perm : 19610K->19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10 secs] During the GC, the rest of the ring will shun the node, and when the collection completes, the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly after once detecting each other as up and completing hinted handoff (details in log). We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot, but the CMS length was shorter (9 seconds) and did not result in the node being shunned from the ring. While the query failures that result from this activity are brief, our retry threshold is set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like to figure out why we see long CMS collections + promotion failures triggering full GCs during a snapshot. Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during a flush / snapshot? Thanks, - Scott --- C. Scott Andreas Engineer, Urban Airship, Inc. http://www.urbanairship.com