Hi,

I have similar issue  with stuck repair. Similar multiregion setup, only
between us-east and private cloud at rackspace. The log mentiones merkle
tree exchanges and I see a lot of dropped communication:

I will comment on your ticket in Jira.

regards,

ondrej cernos


On Fri, Apr 19, 2013 at 4:50 AM, Arya Goudarzi <gouda...@gmail.com> wrote:

> We don't use default ports. Woops! Now I advertised mine. I did try
> disabling internode compression for all in cassandra.yaml but still it did
> not work. I have to open the insecure storage port to public ips.
>
>
> On Tue, Apr 16, 2013 at 4:59 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
>
>> So cassandra does inter node compression. I have not checked but this
>> might be accidentally getting turned on by default. Because the storage
>> port is typically 7000. Not sure why you are allowing 7100. In any case try
>> allowing 7000 or with internode compression off.
>>
>>
>> On Tue, Apr 16, 2013 at 6:42 PM, Arya Goudarzi <gouda...@gmail.com>wrote:
>>
>>> TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but
>>> with 1.2.4, gossip does not see the nodes after restarting all nodes at
>>> once, and repair gets stuck.
>>>
>>> This is a working configuration:
>>> Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
>>> us-west-2
>>> Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
>>> NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
>>> C* instances have a security group called 'cluster1'
>>> security group 'cluster1' in each region is configured as such
>>> Allow TCP:
>>> 7199 from cluster1 (JMX)
>>> 1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all
>>> specific ports, but I have the specific ports just for clarity )
>>> 7100 from cluster1 (Configured Normal Storage)
>>> 7103 from cluster1 (Configured SSL Storage)
>>> 9160 from cluster1 (Configured Thrift RPC Port)
>>> 9160 from <client_group>
>>> foreach node's public IP we also have this rule set to enable cross
>>> region comminication:
>>> 7103 from public_ip (Open SSL storage)
>>>
>>> The above is a functioning and happy setup. You run repair, and it
>>> finishes successfully.
>>>
>>> Broken Setup:
>>>
>>> Upgrade to 1.2.4 without changing any of the above security group
>>> settings:
>>>
>>> Run repair. The repair will get stuck. Thus hanging.
>>>
>>> Now for each public_ip add a security group rule as such to cluster1
>>> security group:
>>>
>>> Allow TCP: 7100 from public_ip
>>>
>>> Run repair. Things will work now. Also after restarting all nodes at the
>>> same time, gossip will see everyone again.
>>>
>>> I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
>>> nothing in terms of networking was changed. If nothing in terms of port and
>>> networking was changed in 1.2, then why the above is happening? I can
>>> constantly reproduce it.
>>>
>>> Please advice.
>>>
>>> -Arya
>>>
>>>
>>
>

Reply via email to