Hi, Alain,

Thanks for your reply.

Unfortunately, it is a rather old version of system which comes with Cassandra 
v1.2.15, and database upgrade does not seem to be a viable solution. We have 
also recently observed a situation that the Cassandra instance froze around one 
minute while the other nodes eventually mark that node DOWN. Here are some logs 
of the scenario, that there is a 1 minute window with no sign of any operation 
was running:

Gossip related :
TRACE [GossipStage:1] 2016-04-13 23:34:08,641 GossipDigestSynVerbHandler.java 
(line 40) Received a GossipDigestSynMessage from /156.1.1.1
TRACE [GossipStage:1] 2016-04-13 23:35:01,081 GossipDigestSynVerbHandler.java 
(line 71) Gossip syn digests are : /156.1.1.1:1460103192:520418 
/156.1.1.4:1460103190:522108 /156.1.1.2:1460103205:522912 
/156.1.1.3:1460551526:41979

GC related:
2016-04-13T23:34:02.675+0000: 487270.189: Total time for which application 
threads were stopped: 0.0677060 seconds
2016-04-13T23:35:01.019+0000: 487328.533: [GC2016-04-13T23:35:01.020+0000: 
487328.534: [ParNew
Desired survivor size 1474560 bytes, new threshold 1 (max 1)
- age   1:    1637144 bytes,    1637144 total
: 843200K->1600K(843200K), 0.0559840 secs] 5631683K->4814397K(8446400K), 
0.0567850 secs] [Times: user=0.67 sys=0.00, real=0.05 secs]

Regular Cassandra operation:
INFO [CompactionExecutor:70229] 2016-04-13 23:34:02,439 CompactionTask.java 
(line 266) Compacted 4 sstables to 
[/opt/ruckuswireless/wsg/db/data/wsg/indexHistoricalRuckusClient/wsg-indexHistoricalRuckusClient-ic-1464,].
  54,743,298 bytes to 53,661,608 (~98% of original) in 29,124ms = 1.757166MB/s. 
 417,517 total rows, 265,853 unique.  Row merge counts were {1:114862, 
2:150328, 3:653, 4:10, }
INFO [HANDSHAKE-/156.1.1.2] 2016-04-13 23:35:01,110 OutboundTcpConnection.java 
(line 418) Handshaking version with /156.1.1.2

The situation comes randomly among all nodes. When this happens, the hector 
client application seems to have trouble connecting to that Cassandra database 
as well, for example,
04-13 23:34:54 [taskExecutor-167] ConcurrentHClientPool:273 ERROR - Transport 
exception in re-opening client in release on 
<ConcurrentCassandraClientPoolByHost>:{localhost(127.0.0.1):9160}

Has anyone had similar experience? The operating system is Ubuntu and kernel 
version is 2.6.32.24. Thanks in advance!

Sincerely,

Michael fong

From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
Sent: Wednesday, April 13, 2016 9:30 PM
To: user@cassandra.apache.org
Subject: Re: C* 1.2.x vs Gossip marking DOWN/UP

Hi Michael,

I had critical issues using 1.2 (.11, I believe) around gossip (but it was like 
2 years ago...).

Are you using the last C* 1.2.19 minor version? If not, you probably should go 
there asap.

A lot of issues like this one 
https://issues.apache.org/jira/browse/CASSANDRA-6297 have been fixed since then 
on C* 1.2, 2.0, 2.1, 2.2, 3.0.X, 3.X. You got to go through steps to upgrade. 
It should be safe and enough to go to the last 1.2 minor to solve this issue.

For your information, even C* 2.0 is no longer supported. The minimum version 
you should use now is 2.1.last.

This technical debt might end up costing you more in terms of time, money and 
Quality of Service that taking care of upgrades. The most probable thing is 
that your bug is fixed already on newer versions. Plus it is not very 
interesting for us to help you as we would have to go through old code, to find 
issues that are most likely already fixed. If you want some support (from 
community or commercial one) you really should upgrade this cluster. Make sure 
your clients are compatible too.

I did not know that some people were still using C* < 2.0 :-).

Cheers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-13 10:58 GMT+02:00 Michael Fong 
<michael.f...@ruckuswireless.com<mailto:michael.f...@ruckuswireless.com>>:
Hi, all


We have been a Cassandra 4-node cluster (C* 1.2.x) where a node marked all the 
other 3 nodes DOWN, and came back UP a few seconds later. There was a 
compaction that kicked in a minute before, roughly 10~MB in size, followed by 
marking all the other nodes DOWN later. In the other words, in the system.log 
we see
00:00:00 Compacting ….
00:00:03 Compacted 8 sstables … 10~ megabytes
00:01:06 InetAddress /x.x.x.4 is now DOWN
00:01:06 InetAddress /x.x.x.3 is now DOWN
00:01:06 InetAddress /x.x.x.1 is now DOWN

There was no significant GC activities in gc.log. We have heard that busy 
compaction activities would cause this behavior, but we cannot reason why this 
could happen logically. How come a compaction operation would stop the Gossip 
thread to perform heartbeat check? Has anyone experienced this kind of behavior 
before?

Thanks in advanced!

Sincerely,

Michael Fong

Reply via email to