> > Has anyone experienced this sort of problem? It would be great to hear from > anyone who has had experience with this sort of issue and/or suggestions for > how to deal with it. > > Thanks, Eric
Yes, i did. Symptoms you described point to concurrent GC FAILURE. During this failure concurrent GC completely stops java program (i.e. cassandra) and does a GC cycle. Other cassandra nodes discover, that node is not responding and considering it dead. If concurrent GC is properly tuned, it should never do stop-the-world and GC ( thats why it is called concurrent ;-) ). Reasons for concurrent GC failures can be several: 1. Not enought java heap - try to raise max java heap limit 2. Improperly sized java heap regions. To help you to narrow the problem, pass -XX:+PrintGCDetails option to JVM launching cassandra node. This will log information about internal GC activities. Let it run till it will be thrown out of cluster again and search for "concurrent mode failure" or "promotion failed" strings.