> Now, pretty much every single scenario points towards connectivity > problem, however we also have few PostgreSQL replication streams In the before time someone had problems with a switch/router that was dropping persistent but idle connections. Doubt this applies, and it would probably result in an error, just throwing it out there.
Have you combed through the logs logging for errors or warnings ? I would repair a single small CF with -pr and watch closely. Consider setting DEBUG logging (you can do it via JMX) org.apache.cassandra.service.AntiEntropyService <- the class the manages repair org.apache.cassandra.streaming <- package that handles streaming There was a fix to repair in 1.0.11 but that has to do with streaming https://github.com/apache/cassandra/blob/cassandra-1.0/CHANGES.txt#L5 Good luck. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 13/07/2012, at 10:16 PM, Bart Swedrowski wrote: > Hello everyone, > > I'm facing quite weird problem with Cassandra since we've added > secondary DC to our cluster and have totally ran out of ideas; this > email is a call for help/advice! > > History looks like: > - we used to have 4 nodes in a single DC > - running Cassandra 0.8.7 > - RF:3 > - around 50GB of data on each node > - randomPartitioner and SimpleSnitch > > All was working fine for over 9 months. Few weeks ago we decided we > want to add another 4 nodes in a second DC and join them to the > cluster. Prior doing that, we upgraded Cassandra to 1.0.9 to push it > out of the doors before the multi-DC work. After upgrade, we left it > working for over a week and it was all good; no issues. > > Then, we added 4 additional nodes in another DC bringing the cluster > to 8 nodes in total, spreading across two DCs, so now we've: > - 8 nodes across 2 DCs, 4 in each DC > - 100Mbps low-latency connection (sub 5ms) running over Cisco ASA > Site-to-Site VPN (which is ikev1 based) > - DC1:3,DC2:3 RFs > - randomPartitioner and using PropertyFileSnitch now > > nodetool ring looks as follows: > $ nodetool -h localhost ring > Address DC Rack Status State Load > Owns Token > > 148873535527910577765226390751398592512 > 192.168.81.2 DC1 RC1 Up Normal 37.9 GB > 12.50% 0 > 192.168.81.3 DC1 RC1 Up Normal 35.32 GB > 12.50% 21267647932558653966460912964485513216 > 192.168.81.4 DC1 RC1 Up Normal 39.51 GB > 12.50% 42535295865117307932921825928971026432 > 192.168.81.5 DC1 RC1 Up Normal 19.42 GB > 12.50% 63802943797675961899382738893456539648 > 192.168.94.178 DC2 RC1 Up Normal 40.72 GB > 12.50% 85070591730234615865843651857942052864 > 192.168.94.179 DC2 RC1 Up Normal 30.42 GB > 12.50% 106338239662793269832304564822427566080 > 192.168.94.180 DC2 RC1 Up Normal 30.94 GB > 12.50% 127605887595351923798765477786913079296 > 192.168.94.181 DC2 RC1 Up Normal 12.75 GB > 12.50% 148873535527910577765226390751398592512 > > (please ignore the fact that nodes are not interleaved; they should be > however there's been hiccup during the implementation phase. Unless > *this* is the problem!) > > Now, the problem: over 7 out of 10 manual repairs are not being > finished. They usually get stuck and show 3 different sympoms: > > 1). Say node 192.168.81.2 runs manual repair, it requests merkle > trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, > 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, > 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not > from 192.168.94.181. 192.168.94.181 logs are saying that it has sent > the merkle tree back but it's never received by 192.168.81.2. > 2). Say node 192.168.81.2 runs manual repair, it requests merkle > trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, > 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, > 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not > from 192.168.94.181. 192.168.94.181 logs are not saying *anything* > about merkle tree being sent. Also compactionstats are not even > saying anything about them being validated (generated) > 3). Merkle trees are being delivered, and nodes are sending data > across to sync theirselves. On a certain occasions, they'll get > "stuck" streaming files between each other at 100% and won't move > forward. Now the interesting bit is, the ones that are getting stuck > are always placed in different DCs! > > Now, pretty much every single scenario points towards connectivity > problem, however we also have few PostgreSQL replication streams > happening over this connection, some other traffic going over and > quite a lot of monitoring happening and none of those are being > affected in any way. > > Also, if random packets are being lost, I'd expect TCP to correct that > (re-transmit them). > > It doesn't matter whether its manual repair or just -pr repair, both > end with pretty much the same. > > Has anyone came across this kind of issue before or have any ideas how > else I could investigate this? The issue is pressing me massively as > this is our live cluster and I've to run manual repairs pretty much > manually (usually multiple times before it finally goes through) every > single day… And also I'm not sure whether cluster is getting affected > in any other way BTW. > > I've gone through Jira issues and considered upgrading to 1.1.X but I > can't see anything that would even look like something that is > happening to my cluster. > > If any further information, like logs, configuration files are needed, > please let me know. > > Any informations, suggestions, advices - greatly appreciated. > > Kind regards, > Bart