Hi, Alain, Thanks for your reply.
Unfortunately, it is a rather old version of system which comes with Cassandra v1.2.15, and database upgrade does not seem to be a viable solution. We have also recently observed a situation that the Cassandra instance froze around one minute while the other nodes eventually mark that node DOWN. Here are some logs of the scenario, that there is a 1 minute window with no sign of any operation was running: Gossip related : TRACE [GossipStage:1] 2016-04-13 23:34:08,641 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /156.1.1.1 TRACE [GossipStage:1] 2016-04-13 23:35:01,081 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /156.1.1.1:1460103192:520418 /156.1.1.4:1460103190:522108 /156.1.1.2:1460103205:522912 /156.1.1.3:1460551526:41979 GC related: 2016-04-13T23:34:02.675+0000: 487270.189: Total time for which application threads were stopped: 0.0677060 seconds 2016-04-13T23:35:01.019+0000: 487328.533: [GC2016-04-13T23:35:01.020+0000: 487328.534: [ParNew Desired survivor size 1474560 bytes, new threshold 1 (max 1) - age 1: 1637144 bytes, 1637144 total : 843200K->1600K(843200K), 0.0559840 secs] 5631683K->4814397K(8446400K), 0.0567850 secs] [Times: user=0.67 sys=0.00, real=0.05 secs] Regular Cassandra operation: INFO [CompactionExecutor:70229] 2016-04-13 23:34:02,439 CompactionTask.java (line 266) Compacted 4 sstables to [/opt/ruckuswireless/wsg/db/data/wsg/indexHistoricalRuckusClient/wsg-indexHistoricalRuckusClient-ic-1464,]. 54,743,298 bytes to 53,661,608 (~98% of original) in 29,124ms = 1.757166MB/s. 417,517 total rows, 265,853 unique. Row merge counts were {1:114862, 2:150328, 3:653, 4:10, } INFO [HANDSHAKE-/156.1.1.2] 2016-04-13 23:35:01,110 OutboundTcpConnection.java (line 418) Handshaking version with /156.1.1.2 The situation comes randomly among all nodes. When this happens, the hector client application seems to have trouble connecting to that Cassandra database as well, for example, 04-13 23:34:54 [taskExecutor-167] ConcurrentHClientPool:273 ERROR - Transport exception in re-opening client in release on <ConcurrentCassandraClientPoolByHost>:{localhost(127.0.0.1):9160} Has anyone had similar experience? The operating system is Ubuntu and kernel version is 2.6.32.24. Thanks in advance! Sincerely, Michael fong From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] Sent: Wednesday, April 13, 2016 9:30 PM To: user@cassandra.apache.org Subject: Re: C* 1.2.x vs Gossip marking DOWN/UP Hi Michael, I had critical issues using 1.2 (.11, I believe) around gossip (but it was like 2 years ago...). Are you using the last C* 1.2.19 minor version? If not, you probably should go there asap. A lot of issues like this one https://issues.apache.org/jira/browse/CASSANDRA-6297 have been fixed since then on C* 1.2, 2.0, 2.1, 2.2, 3.0.X, 3.X. You got to go through steps to upgrade. It should be safe and enough to go to the last 1.2 minor to solve this issue. For your information, even C* 2.0 is no longer supported. The minimum version you should use now is 2.1.last. This technical debt might end up costing you more in terms of time, money and Quality of Service that taking care of upgrades. The most probable thing is that your bug is fixed already on newer versions. Plus it is not very interesting for us to help you as we would have to go through old code, to find issues that are most likely already fixed. If you want some support (from community or commercial one) you really should upgrade this cluster. Make sure your clients are compatible too. I did not know that some people were still using C* < 2.0 :-). Cheers, ----------------------- Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com> France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-04-13 10:58 GMT+02:00 Michael Fong <michael.f...@ruckuswireless.com<mailto:michael.f...@ruckuswireless.com>>: Hi, all We have been a Cassandra 4-node cluster (C* 1.2.x) where a node marked all the other 3 nodes DOWN, and came back UP a few seconds later. There was a compaction that kicked in a minute before, roughly 10~MB in size, followed by marking all the other nodes DOWN later. In the other words, in the system.log we see 00:00:00 Compacting …. 00:00:03 Compacted 8 sstables … 10~ megabytes 00:01:06 InetAddress /x.x.x.4 is now DOWN 00:01:06 InetAddress /x.x.x.3 is now DOWN 00:01:06 InetAddress /x.x.x.1 is now DOWN There was no significant GC activities in gc.log. We have heard that busy compaction activities would cause this behavior, but we cannot reason why this could happen logically. How come a compaction operation would stop the Gossip thread to perform heartbeat check? Has anyone experienced this kind of behavior before? Thanks in advanced! Sincerely, Michael Fong