Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
Just food for thought. Elevated read requests won’t result in escalating pending compactions, except in the corner case where the reads trigger additional write work, like for a repair or lurking tombstones deemed droppable. For a sustained growth in pending compactions, that’s not looking lik

Re: Aws instance stop and star with ebs

2019-11-06 Thread Rahul Reddy
Thanks Daemeon , will do that and post the results. I found jira in open state with similar issue https://issues.apache.org/jira/browse/CASSANDRA-13984 On Wed, Nov 6, 2019 at 1:49 PM daemeon reiydelle wrote: > No connection timeouts? No tcp level retries? I am sorry truly sorry but > you have e

RE: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Steinmaurer, Thomas
Reid, thanks for thoughts. I agree with your last comment and I’m pretty sure/convinced that the increasing number of SSTables is causing the issue, although I’m not sure if compaction or read requests (after the node flipped from UJ to UN) or both, but I tend more towards client read requests

Re: Aws instance stop and star with ebs

2019-11-06 Thread daemeon reiydelle
No connection timeouts? No tcp level retries? I am sorry truly sorry but you have exceeded my capability. I have never seen a java.io timeout with out either a session half open failure (no response) or multiple retries. I am out of my depth, so please feel free to ignore but, did you see the pack

Re: Aws instance stop and star with ebs

2019-11-06 Thread Rahul Reddy
And this is on the node which was not stopped was active and didn't had issues with tcp before. Only after east node stopped and started it started seeing errors. Please let me know if anything else need to be checked On Wed, Nov 6, 2019, 12:18 PM Reid Pinchback wrote: > Almost 15 minutes, that

Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
The other thing that comes to mind is that the increase in pending compactions suggests back pressure on compaction activity. GC is only one possible source of that. Between your throughput setting and how your disk I/O is set up, maybe that’s throttling you to a rate where the rate of added r

Re: Aws instance stop and star with ebs

2019-11-06 Thread Reid Pinchback
Almost 15 minutes, that sounds suspiciously like blocking on a default TCP socket timeout. From: Rahul Reddy Reply-To: "user@cassandra.apache.org" Date: Wednesday, November 6, 2019 at 12:12 PM To: "user@cassandra.apache.org" Subject: Re: Aws instance stop and star with ebs Message from Extern

Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
My first thought was that you were running into the merkle tree depth problem, but the details on the ticket don’t seem to confirm that. It does look like eden is too small. C* lives in Java’s GC pain point, a lot of medium-lifetime objects. If you haven’t already done so, you’ll want to con

Re: Aws instance stop and star with ebs

2019-11-06 Thread Rahul Reddy
Thank you. I have stopped instance in east. i see that all other instances can gossip to that instance and only one instance in west having issues gossiping to that node. when i enable debug mode i see below on the west node i see bellow messages from 16:32 to 16:47 DEBUG [RMI TCP Connection(272

Medusa : a new OSS backup/restore tool for Apache Cassandra

2019-11-06 Thread Alexander Dejanovski
Hi folks, I'm happy to announce that Spotify and TLP have been collaborating to create and open source a new backup and restore tool for Apache Cassandra : https://github.com/spotify/cassandra-medusa It is released under the Apache 2.0 license. It can perform full and differential backups, in pla

Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Steinmaurer, Thomas
Hello, after moving from 2.1.18 to 3.0.18, we are facing OOM situations after several hours a node has successfully joined a cluster (via auto-bootstrap). I have created the following ticket trying to describe the situation, including hprof / MAT screens: https://issues.apache.org/jira/browse/C