Hi Michael, I had critical issues using 1.2 (.11, I believe) around gossip (but it was like 2 years ago...).
Are you using the last C* 1.2.19 minor version? If not, you probably should go there asap. A lot of issues like this one https://issues.apache.org/jira/browse/CASSANDRA-6297 have been fixed since then on C* 1.2, 2.0, 2.1, 2.2, 3.0.X, 3.X. You got to go through steps to upgrade. It should be safe and enough to go to the last 1.2 minor to solve this issue. For your information, even C* 2.0 is no longer supported. The minimum version you should use now is 2.1.last. This technical debt might end up costing you more in terms of time, money and Quality of Service that taking care of upgrades. The most probable thing is that your bug is fixed already on newer versions. Plus it is not very interesting for us to help you as we would have to go through old code, to find issues that are most likely already fixed. If you want some support (from community or commercial one) you really should upgrade this cluster. Make sure your clients are compatible too. I did not know that some people were still using C* < 2.0 :-). Cheers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-04-13 10:58 GMT+02:00 Michael Fong <michael.f...@ruckuswireless.com>: > Hi, all > > > > > > We have been a Cassandra 4-node cluster (C* 1.2.x) where a node marked all > the other 3 nodes DOWN, and came back UP a few seconds later. There was a > compaction that kicked in a minute before, roughly 10~MB in size, followed > by marking all the other nodes DOWN later. In the other words, in the > system.log we see > > 00:00:00 Compacting …. > > 00:00:03 Compacted 8 sstables … 10~ megabytes > > 00:01:06 InetAddress /x.x.x.4 is now DOWN > > 00:01:06 InetAddress /x.x.x.3 is now DOWN > > 00:01:06 InetAddress /x.x.x.1 is now DOWN > > > > There was no significant GC activities in gc.log. We have heard that busy > compaction activities would cause this behavior, but we cannot reason why > this could happen logically. How come a compaction operation would stop the > Gossip thread to perform heartbeat check? Has anyone experienced this kind > of behavior before? > > > > Thanks in advanced! > > > > Sincerely, > > > > Michael Fong >