TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but with
1.2.4, gossip does not see the nodes after restarting all nodes at once,
and repair gets stuck.

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all specific
ports, but I have the specific ports just for clarity )
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region
comminication:
7103 from public_ip (Open SSL storage)

The above is a functioning and happy setup. You run repair, and it finishes
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will get stuck. Thus hanging.

Now for each public_ip add a security group rule as such to cluster1
security group:

Allow TCP: 7100 from public_ip

Run repair. Things will work now. Also after restarting all nodes at the
same time, gossip will see everyone again.

I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
nothing in terms of networking was changed. If nothing in terms of port and
networking was changed in 1.2, then why the above is happening? I can
constantly reproduce it.

Please advice.

-Arya

Reply via email to