This kind of information is very helpful. Thank you to share your experience.
maki 2011/7/27 Teijo Holzer <thol...@wetafx.co.nz>: > Hi, > > I thought I share the following with this mailing list as a number of other > users seem to have had similar problems. > > We have the following set-up: > > OS: CentOS 5.5 > RAM: 16GB > JVM heap size: 8GB (also tested with 14GB) > Cassandra version: 0.7.6-2 (also tested with 0.7.7) > Oracle JDK version: 1.6.0_26 > Number of nodes: 5 > Load per node: ~40GB > Replication factor: 3 > Number of requests/day: 2.5 Million (95% inserts) > Total net insert data/day: 1GB > Default TTL for most of the data: 10 days > > This set-up has been operating successfully for a few months, however > recently > we started seeing multi-node failures, usually triggered by a repair, but > occasionally also under normal operation. A repair on node 3,4 and 5 would > always cause the cluster as whole to fail, whereas node 1 & 2 completed > their > repair cycles successfully. > > These failures would usually result in 2 or 3 nodes becoming unresponsive > and > dropping out of the cluster, resulting in client failure rates to spike up > to > ~10%. We normally operate with a failure rate of <0.1%. > > The relevant log entries showed a complete heap memory exhaustion within 1 > minute (see log lines below where we experimented with a larger heap size of > 14GB). Also of interest was a number of huge SliceQueryFilter collections > running concurrently on the nodes in question (see log lines below). > > The way we ended recovering from this situation was as follows. Remember > these > steps were taken to get an unstable cluster back under control, so you might > want to revert some of the changes once the cluster is stable again. > > Set "disk_access_mode: standard" in cassandra.yaml > This allowed us to prevent the JVM blowing out the hard limit of 8GB via > large > mmaps. Heap size was set to 8GB (RAM/2). That meant the JVM was never using > more than 8GB total. mlockall didn't seem to make a difference for our > particular problem. > > Turn off all row & key caches via cassandra-cli, e.g. > update column family Example with rows_cached=0; > update column family Example with keys_cached=0; > We were seeing compacted row maximum sizes of ~800MB from cfstats, that's > why > we turned them all off. Again, we saw a significant drop in the actual > memory > used from the available maximum of 8GB. Obviously, this will affect reads, > but > as 95% of our requests are inserts, it didn't matter so much for us. > > Bootstrap problematic node: > Kill Cassandra > Change "auto_bootstrap: true" in cassandra.yaml, remove own IP address from > list of seeds (important) > Delete all data directories (i.e. commit-log, data, saved-caches) > Start Cassandra > Wait for bootstrap to finish (see log & nodetool) > Change "auto_bootstrap: false" > (Run repair) > > The first bootstrap completed very quickly, so we decided to bootstrap every > node in the cluster (not just the problematic ones). This resulted in some > data > loss. The next time we will follow the bootstrap by a repair before > bootstrapping & repairing the next node to minimize data loss. > > After this procedure, the cluster was operating normally again. > > We now run a continuous rolling repair, followed by a (major) compaction and > a > manual garbage collection. As the repairs a required anyway, we decided to > run > them all the time in a continuous fashion. Therefore, potential problems can > be identified earlier. > > The major compaction followed by a manual GC allows us to keep the disk > usage low on each node. The manual GC is necessary as the unused files on > disk are only really deleted when the reference is garbage collected inside > the JVM (a restart would achieve the same). > > We also collected some statistics in regards to the duration of some of the > operations: > > cleanup/compact: ~1 min/GB > repair: ~2-3 min/GB > bootstrap: ~1 min/GB > > This means that if you have a node with 60GB of data, it will take ~1hr to > compact and ~2-3hrs to repair. Therefore, it is advisable to keep the data > per > node below ~120GB. We achieve this by using an aggressive TTL on most of our > writes. > > Cheers, > > Teijo > > Here are the relevant log entries showing the OOM conditions: > > > [2011-07-21 11:12:11,059] INFO: GC for ParNew: 1141 ms, 509843976 reclaimed > leaving 1469443752 used; max is 14675869696 (ScheduledTasks:1 > GCInspector.java:128) > [2011-07-21 11:12:15,226] INFO: GC for ParNew: 1149 ms, 564409392 reclaimed > leaving 2247228920 used; max is 14675869696 (ScheduledTasks:1 > GCInspector.java:128) > ... > [2011-07-21 11:12:55,062] INFO: GC for ParNew: 1110 ms, 564365792 reclaimed > leaving 12901974704 used; max is 14675869696 (ScheduledTasks:1 > GCInspector.java:128) > > [2011-07-21 10:57:23,548] DEBUG: collecting 4354206 of 2147483647: > 940657e5b3b0d759eb4a14a7228ae365:false:41@1311102443362542 (ReadStage:27 > SliceQueryFilter.java:123) > -- w3m