@Vladimir
We tried with 12Gb and 16Gb, the problem appeared eventually too. In this particular cluster we have 143 tables across 2 keyspaces. @Alexander We have one table with a max partition of 2.68GB, one of 256 MB, a bunch with the size varying between 10MB to 100MB ~. Then there's the rest with the max lower than 10MB. On the biggest, the 99% is around 60MB, 98% around 25MB, 95% around 5.5MB. On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB. Could the 1% here really have that much impact ? We do write a lot to the biggest table and read quite often too, however I have no way to know if that big partition is ever read. On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote: > Hi Vincent, > > one of the usual causes of OOMs is very large partitions. > Could you check your nodetool cfstats output in search of large > partitions ? If you find one (or more), run nodetool cfhistograms on > those tables to get a view of the partition sizes distribution. > > Thanks > > On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin > <vla...@winguzone.com> wrote: >> __ >> Did you try any value in the range 8-20 (e.g. 60-70% of physical >> memory). >> Also how many tables do you have across all keyspaces? Each table can >> consume minimum 1M of Java heap. >> >> Best regards, Vladimir Yudovin, >> *Winguzone[1] - Hosted Cloud Cassandra Launch your cluster in >> minutes.* >> >> >> ---- On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann >> <m...@vrischmann.me>* wrote ---- >> >>> Hello, >>> >>> we have a 8 node Cassandra 2.1.15 cluster at work which is giving us >>> a lot of trouble lately. >>> >>> The problem is simple: nodes regularly die because of an out of >>> memory exception or the Linux OOM killer decides to kill the >>> process. >>> For a couple of weeks now we increased the heap to 20Gb hoping it >>> would solve the out of memory errors, but in fact it didn't; instead >>> of getting out of memory exception the OOM killer killed the JVM. >>> >>> We reduced the heap on some nodes to 8Gb to see if it would work >>> better, but some nodes crashed again with out of memory exception. >>> >>> I suspect some of our tables are badly modelled, which would cause >>> Cassandra to allocate a lot of data, however I don't how to prove >>> that and/or find which table is bad, and which query is responsible. >>> >>> I tried looking at metrics in JMX, and tried profiling using mission >>> control but it didn't really help; it's possible I missed it because >>> I have no idea what to look for exactly. >>> >>> Anyone have some advice for troubleshooting this ? >>> >>> Thanks. > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com[2] Links: 1. https://winguzone.com?from=list 2. http://www.thelastpickle.com/