Hello everyone,

I'm facing quite weird problem with Cassandra since we've added
secondary DC to our cluster and have totally ran out of ideas; this
email is a call for help/advice!

History looks like:
- we used to have 4 nodes in a single DC
- running Cassandra 0.8.7
- RF:3
- around 50GB of data on each node
- randomPartitioner and SimpleSnitch

All was working fine for over 9 months.  Few weeks ago we decided we
want to add another 4 nodes in a second DC and join them to the
cluster.  Prior doing that, we upgraded Cassandra to 1.0.9 to push it
out of the doors before the multi-DC work.  After upgrade, we left it
working for over a week and it was all good; no issues.

Then, we added 4 additional nodes in another DC bringing the cluster
to 8 nodes in total, spreading across two DCs, so now we've:
- 8 nodes across 2 DCs, 4 in each DC
- 100Mbps low-latency connection (sub 5ms) running over Cisco ASA
Site-to-Site VPN (which is ikev1 based)
- DC1:3,DC2:3 RFs
- randomPartitioner and using PropertyFileSnitch now

nodetool ring looks as follows:
$ nodetool -h localhost ring
Address         DC          Rack        Status State   Load
Owns    Token

        148873535527910577765226390751398592512
192.168.81.2    DC1         RC1         Up     Normal  37.9 GB
12.50%  0
192.168.81.3    DC1         RC1         Up     Normal  35.32 GB
12.50%  21267647932558653966460912964485513216
192.168.81.4    DC1         RC1         Up     Normal  39.51 GB
12.50%  42535295865117307932921825928971026432
192.168.81.5    DC1         RC1         Up     Normal  19.42 GB
12.50%  63802943797675961899382738893456539648
192.168.94.178  DC2         RC1         Up     Normal  40.72 GB
12.50%  85070591730234615865843651857942052864
192.168.94.179  DC2         RC1         Up     Normal  30.42 GB
12.50%  106338239662793269832304564822427566080
192.168.94.180  DC2         RC1         Up     Normal  30.94 GB
12.50%  127605887595351923798765477786913079296
192.168.94.181  DC2         RC1         Up     Normal  12.75 GB
12.50%  148873535527910577765226390751398592512

(please ignore the fact that nodes are not interleaved; they should be
however there's been hiccup during the implementation phase.  Unless
*this* is the problem!)

Now, the problem: over 7 out of 10 manual repairs are not being
finished.  They usually get stuck and show 3 different sympoms:

  1). Say node 192.168.81.2 runs manual repair, it requests merkle
trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178,
192.168.94.179, 192.168.94.181.  It receives them from 192.168.81.2,
192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not
from 192.168.94.181.  192.168.94.181 logs are saying that it has sent
the merkle tree back but it's never received by 192.168.81.2.
  2). Say node 192.168.81.2 runs manual repair, it requests merkle
trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178,
192.168.94.179, 192.168.94.181.  It receives them from 192.168.81.2,
192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not
from 192.168.94.181.  192.168.94.181 logs are not saying *anything*
about merkle tree being sent.  Also compactionstats are not even
saying anything about them being validated (generated)
  3). Merkle trees are being delivered, and nodes are sending data
across to sync theirselves.  On a certain occasions, they'll get
"stuck" streaming files between each other at 100% and won't move
forward.  Now the interesting bit is, the ones that are getting stuck
are always placed in different DCs!

Now, pretty much every single scenario points towards connectivity
problem, however we also have few PostgreSQL replication streams
happening over this connection, some other traffic going over and
quite a lot of monitoring happening and none of those are being
affected in any way.

Also, if random packets are being lost, I'd expect TCP to correct that
(re-transmit them).

It doesn't matter whether its manual repair or just -pr repair, both
end with pretty much the same.

Has anyone came across this kind of issue before or have any ideas how
else I could investigate this?  The issue is pressing me massively as
this is our live cluster and I've to run manual repairs pretty much
manually (usually multiple times before it finally goes through) every
single day…  And also I'm not sure whether cluster is getting affected
in any other way BTW.

I've gone through Jira issues and considered upgrading to 1.1.X but I
can't see anything that would even look like something that is
happening to my cluster.

If any further information, like logs, configuration files are needed,
please let me know.

Any informations, suggestions, advices - greatly appreciated.

Kind regards,
Bart

Reply via email to