Thanks for your answer Alexander.
We're writing constantly to the table, we estimate it's something like 1.5k to 2k writes per second. Some of these requests update a bunch of fields, some update fields + append something to a set. We don't read constantly from it but when we do it's a lot of read, up to 20k reads per second sometimes. For this particular keyspace everything is using the size tiered compaction strategy. - Every node is a physical server, has a 8-Core CPU, 32GB of ram and 3TB of SSD. - Java version is 1.8.0_101 for all nodes except one which is using 1.8.0_111 (only for about a week I think, before that it used 1.8.0_101 too). - We're using the G1 GC. I looked at the 19th and on that day we had: - 1505 GCs - 2 Old Gen GCs which took around 5s each - the rest are New Gen GCs, with only 1 other 1s. There's 15 to 20 GCs which took between 0.4 and 0.7s. The rest is between 250ms and 400ms approximately. Sometimes, there are 3/4/5 GCs in a row in like 2 seconds, each taking between 250ms to 400ms, but it's kinda rare from what I can see. - So regarding GC logs, I have them enabled, I've got a bunch of gc.log.X files in /var/log/cassandra, but somehow I can't find any log files for certain periods. On one node which crashed this morning I lost like a week of GC logs, no idea what is happening there... - I'll just put a couple of warnings here, there are around 9k just for today. WARN [SharedPool-Worker-8] 2016-11-21 17:02:00,497 SliceQueryFilter.java:320 - Read 2001 live and 11129 tombstone cells in foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 columns were requested, slices=[-] WARN [SharedPool-Worker-1] 2016-11-21 17:02:02,559 SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!- ] WARN [SharedPool-Worker-2] 2016-11-21 17:02:05,286 SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!- ] WARN [SharedPool-Worker-11] 2016-11-21 17:02:08,860 SliceQueryFilter.java:320 - Read 2001 live and 19966 tombstone cells in foo.install_info for key: foo@IOS:10 (see tombstone_warn_threshold). 2000 columns were requested, slices=[-] So, we're guessing this is bad since it's warning us, however does this have a significant on the heap / GC ? I don't really know. - cfstats tells me this: Average live cells per slice (last five minutes): 1458.029594846951 Maximum live cells per slice (last five minutes): 2001.0 Average tombstones per slice (last five minutes): 1108.2466913854232 Maximum tombstones per slice (last five minutes): 22602.0 - regarding swap, it's not disabled anywhere, I must say we never really thought about it. Does it provide a significant benefit ? Thanks for your help, really appreciated ! On Mon, Nov 21, 2016, at 04:13 PM, Alexander Dejanovski wrote: > Vincent, > > only the 2.68GB partition is out of bounds here, all the others > (<256MB) shouldn't be much of a problem. > It could put pressure on your heap if it is often read and/or > compacted. > But to answer your question about the 1% harming the cluster, a few > big partitions can definitely be a big problem depending on your > access patterns. > Which compaction strategy are you using on this table ? > > Could you provide/check the following things on a node that crashed > recently : > * Hardware specifications (how many cores ? how much RAM ? Bare metal > or VMs ?) > * Java version > * GC pauses throughout a day (grep GCInspector > /var/log/cassandra/system.log) : check if you have many pauses that > take more than 1 second > * GC logs at the time of a crash (if you don't produce any, you > should activate them in cassandra-env.sh) > * Tombstone warnings in the logs and high number of tombstone read in > cfstats > * Make sure swap is disabled > > Cheers, > > > On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann > <m...@vrischmann.me> wrote: >> __ >> @Vladimir >> >> We tried with 12Gb and 16Gb, the problem appeared eventually too. >> In this particular cluster we have 143 tables across 2 keyspaces. >> >> @Alexander >> >> We have one table with a max partition of 2.68GB, one of 256 MB, a >> bunch with the size varying between 10MB to 100MB ~. Then there's the >> rest with the max lower than 10MB. >> >> On the biggest, the 99% is around 60MB, 98% around 25MB, 95% >> around 5.5MB. >> On the one with max of 256MB, the 99% is around 4.6MB, 98% >> around 2MB. >> >> Could the 1% here really have that much impact ? We do write a lot to >> the biggest table and read quite often too, however I have no way to >> know if that big partition is ever read. >> >> >> On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote: >>> Hi Vincent, >>> >>> one of the usual causes of OOMs is very large partitions. >>> Could you check your nodetool cfstats output in search of large >>> partitions ? If you find one (or more), run nodetool cfhistograms on >>> those tables to get a view of the partition sizes distribution. >>> >>> Thanks >>> >>> On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin >>> <vla...@winguzone.com> wrote: >>>> __ >>>> Did you try any value in the range 8-20 (e.g. 60-70% of physical >>>> memory). >>>> Also how many tables do you have across all keyspaces? Each table >>>> can consume minimum 1M of Java heap. >>>> >>>> Best regards, Vladimir Yudovin, >>>> *Winguzone[1] - Hosted Cloud Cassandra Launch your cluster in >>>> minutes.* >>>> >>>> >>>> ---- On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann >>>> <m...@vrischmann.me>* wrote ---- >>>> >>>>> Hello, >>>>> >>>>> we have a 8 node Cassandra 2.1.15 cluster at work which is giving >>>>> us a lot of trouble lately. >>>>> >>>>> The problem is simple: nodes regularly die because of an out of >>>>> memory exception or the Linux OOM killer decides to kill the >>>>> process. >>>>> For a couple of weeks now we increased the heap to 20Gb hoping it >>>>> would solve the out of memory errors, but in fact it didn't; >>>>> instead of getting out of memory exception the OOM killer killed >>>>> the JVM. >>>>> >>>>> We reduced the heap on some nodes to 8Gb to see if it would work >>>>> better, but some nodes crashed again with out of memory exception. >>>>> >>>>> I suspect some of our tables are badly modelled, which would cause >>>>> Cassandra to allocate a lot of data, however I don't how to prove >>>>> that and/or find which table is bad, and which query is >>>>> responsible. >>>>> >>>>> I tried looking at metrics in JMX, and tried profiling using >>>>> mission control but it didn't really help; it's possible I missed >>>>> it because I have no idea what to look for exactly. >>>>> >>>>> Anyone have some advice for troubleshooting this ? >>>>> >>>>> Thanks. >>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com[2] >> > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com[3] Links: 1. https://winguzone.com?from=list 2. http://www.thelastpickle.com/ 3. http://www.thelastpickle.com/