In fact I truncated hints table to stabilize the cluster. Through the heap dumps I was able to identify the table on which there were numerous queries. Then I focused on system_traces.session table around the time OOM occurred. It turned out to be a full table scan on a large table which caused OOM.
Thanks everyone of you. ________________________________ From: Jeff Jirsa <jji...@apache.org> Sent: Tuesday, March 7, 2017 1:19 PM To: user@cassandra.apache.org Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time On 2017-03-03 09:18 (-0800), Shravan Ch <chall...@outlook.com> wrote: > > nodetool compactionstats -H > pending tasks: 3 > compaction type keyspace table > completed total unit progress > Compaction system hints > 28.5 GB 92.38 GB bytes 30.85% > > The hint buildup is also something that could have caused OOMs, too. Hints are stored for a given host in a single partition, which means it's common for a single row/partition to get huge if you have a single host flapping. If you see "Compacting large row" messages for the hint rows, I suspect you'll find that one of the hosts/rows is responsible for most of that 92GB of hints, which means when you try to deliver the hints, you'll read from a huge partition, which creates memory pressure (see: CASSANDRA-9754) leading to GC pauses (or ooms), which then causes you to flap, which causes you to create more hints, which causes an ugly spiral. In 3.0, hints were rewritten to avoid this problem, but short term, you may need to truncate your hints to get healthy (assuming it's safe for you to do so, where 'safe' is based on your read+write consistency levels).