Hi, I have similar issue with stuck repair. Similar multiregion setup, only between us-east and private cloud at rackspace. The log mentiones merkle tree exchanges and I see a lot of dropped communication:
I will comment on your ticket in Jira. regards, ondrej cernos On Fri, Apr 19, 2013 at 4:50 AM, Arya Goudarzi <gouda...@gmail.com> wrote: > We don't use default ports. Woops! Now I advertised mine. I did try > disabling internode compression for all in cassandra.yaml but still it did > not work. I have to open the insecure storage port to public ips. > > > On Tue, Apr 16, 2013 at 4:59 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> So cassandra does inter node compression. I have not checked but this >> might be accidentally getting turned on by default. Because the storage >> port is typically 7000. Not sure why you are allowing 7100. In any case try >> allowing 7000 or with internode compression off. >> >> >> On Tue, Apr 16, 2013 at 6:42 PM, Arya Goudarzi <gouda...@gmail.com>wrote: >> >>> TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but >>> with 1.2.4, gossip does not see the nodes after restarting all nodes at >>> once, and repair gets stuck. >>> >>> This is a working configuration: >>> Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in >>> us-west-2 >>> Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and >>> NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3; >>> C* instances have a security group called 'cluster1' >>> security group 'cluster1' in each region is configured as such >>> Allow TCP: >>> 7199 from cluster1 (JMX) >>> 1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all >>> specific ports, but I have the specific ports just for clarity ) >>> 7100 from cluster1 (Configured Normal Storage) >>> 7103 from cluster1 (Configured SSL Storage) >>> 9160 from cluster1 (Configured Thrift RPC Port) >>> 9160 from <client_group> >>> foreach node's public IP we also have this rule set to enable cross >>> region comminication: >>> 7103 from public_ip (Open SSL storage) >>> >>> The above is a functioning and happy setup. You run repair, and it >>> finishes successfully. >>> >>> Broken Setup: >>> >>> Upgrade to 1.2.4 without changing any of the above security group >>> settings: >>> >>> Run repair. The repair will get stuck. Thus hanging. >>> >>> Now for each public_ip add a security group rule as such to cluster1 >>> security group: >>> >>> Allow TCP: 7100 from public_ip >>> >>> Run repair. Things will work now. Also after restarting all nodes at the >>> same time, gossip will see everyone again. >>> >>> I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that >>> nothing in terms of networking was changed. If nothing in terms of port and >>> networking was changed in 1.2, then why the above is happening? I can >>> constantly reproduce it. >>> >>> Please advice. >>> >>> -Arya >>> >>> >> >