Re: Recovering from a multi-node cluster failure caused by OOM on repairs

Maki Watanabe Wed, 27 Jul 2011 01:45:18 -0700

This kind of information is very helpful.
Thank you to share your experience.


maki


2011/7/27 Teijo Holzer <thol...@wetafx.co.nz>:
> Hi,
>
> I thought I share the following with this mailing list as a number of other
> users seem to have had similar problems.
>
> We have the following set-up:
>
> OS: CentOS 5.5
> RAM: 16GB
> JVM heap size: 8GB (also tested with 14GB)
> Cassandra version: 0.7.6-2 (also tested with 0.7.7)
> Oracle JDK version: 1.6.0_26
> Number of nodes: 5
> Load per node: ~40GB
> Replication factor: 3
> Number of requests/day: 2.5 Million (95% inserts)
> Total net insert data/day: 1GB
> Default TTL for most of the data: 10 days
>
> This set-up has been operating successfully for a few months, however
> recently
> we started seeing multi-node failures, usually triggered by a repair, but
> occasionally also under normal operation. A repair on node 3,4 and 5 would
> always cause the cluster as whole to fail, whereas node 1 & 2 completed
> their
> repair cycles successfully.
>
> These failures would usually result in 2 or 3 nodes becoming unresponsive
> and
> dropping out of the cluster, resulting in client failure rates to spike up
> to
> ~10%. We normally operate with a failure rate of <0.1%.
>
> The relevant log entries showed a complete heap memory exhaustion within 1
> minute (see log lines below where we experimented with a larger heap size of
> 14GB). Also of interest was a number of huge SliceQueryFilter collections
> running concurrently on the nodes in question (see log lines below).
>
> The way we ended recovering from this situation was as follows. Remember
> these
> steps were taken to get an unstable cluster back under control, so you might
> want to revert some of the changes once the cluster is stable again.
>
> Set "disk_access_mode: standard" in cassandra.yaml
> This allowed us to prevent the JVM blowing out the hard limit of 8GB via
> large
> mmaps. Heap size was set to 8GB (RAM/2). That meant the JVM was never using
> more than 8GB total. mlockall didn't seem to make a difference for our
> particular problem.
>
> Turn off all row & key caches via cassandra-cli, e.g.
> update column family Example with rows_cached=0;
> update column family Example with keys_cached=0;
> We were seeing compacted row maximum sizes of ~800MB from cfstats, that's
> why
> we turned them all off. Again, we saw a significant drop in the actual
> memory
> used from the available maximum of 8GB. Obviously, this will affect reads,
> but
> as 95% of our requests are inserts, it didn't matter so much for us.
>
> Bootstrap problematic node:
> Kill Cassandra
> Change "auto_bootstrap: true" in cassandra.yaml, remove own IP address from
> list of seeds (important)
> Delete all data directories (i.e. commit-log, data, saved-caches)
> Start Cassandra
> Wait for bootstrap to finish (see log & nodetool)
> Change "auto_bootstrap: false"
> (Run repair)
>
> The first bootstrap completed very quickly, so we decided to bootstrap every
> node in the cluster (not just the problematic ones). This resulted in some
> data
> loss. The next time we will follow the bootstrap by a repair before
> bootstrapping & repairing the next node to minimize data loss.
>
> After this procedure, the cluster was operating normally again.
>
> We now run a continuous rolling repair, followed by a (major) compaction and
> a
> manual garbage collection. As the repairs a required anyway, we decided to
> run
> them all the time in a continuous fashion. Therefore, potential problems can
> be identified earlier.
>
> The major compaction followed by a manual GC allows us to keep the disk
> usage low on each node. The manual GC is necessary as the unused files on
> disk are only really deleted when the reference is garbage collected inside
> the JVM (a restart would achieve the same).
>
> We also collected some statistics in regards to the duration of some of the
> operations:
>
> cleanup/compact: ~1 min/GB
> repair: ~2-3 min/GB
> bootstrap: ~1 min/GB
>
> This means that if you have a node with 60GB of data, it will take ~1hr to
> compact and ~2-3hrs to repair. Therefore, it is advisable to keep the data
> per
> node below ~120GB. We achieve this by using an aggressive TTL on most of our
> writes.
>
> Cheers,
>
>   Teijo
>
> Here are the relevant log entries showing the OOM conditions:
>
>
> [2011-07-21 11:12:11,059] INFO: GC for ParNew: 1141 ms, 509843976 reclaimed
> leaving 1469443752 used; max is 14675869696 (ScheduledTasks:1
> GCInspector.java:128)
> [2011-07-21 11:12:15,226] INFO: GC for ParNew: 1149 ms, 564409392 reclaimed
> leaving 2247228920 used; max is 14675869696 (ScheduledTasks:1
> GCInspector.java:128)
> ...
> [2011-07-21 11:12:55,062] INFO: GC for ParNew: 1110 ms, 564365792 reclaimed
> leaving 12901974704 used; max is 14675869696 (ScheduledTasks:1
> GCInspector.java:128)
>
> [2011-07-21 10:57:23,548] DEBUG: collecting 4354206 of 2147483647:
> 940657e5b3b0d759eb4a14a7228ae365:false:41@1311102443362542 (ReadStage:27
> SliceQueryFilter.java:123)
>



-- 
w3m

Re: Recovering from a multi-node cluster failure caused by OOM on repairs

Reply via email to