It seems that with NTP properly configured, the replication is now working as expected, but there are still a lot of read timeouts. The troubleshooting continues...
On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt <srobe...@stanford.edu>wrote: > Thanks Michael, I will try that out. > > > On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael <michael.la...@nytimes.com > > wrote: > >> We had a similar problem when our nodes could not sync using ntp due to >> VPC ACL settings. -ml >> >> >> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <srobe...@stanford.edu >> > wrote: >> >>> Hi all, >>> >>> I am attempting to bring up our new app on a 3-node cluster and am >>> having problems with frequent read timeouts and slow inter-node >>> replication. Initially, these errors were mostly occurring in our app >>> server, affecting 0.02%-1.0% of our queries in an otherwise unloaded >>> cluster. No exceptions were logged on the servers in this case, and reads >>> in a single node environment with the same code and client driver virtually >>> never see exceptions like this, so I suspect problems with the >>> inter-cluster communication between nodes. >>> >>> The 3 nodes are deployed in a single AWS VPC, and are all in a common >>> subnet. The Cassandra version is 2.0.2 following an upgrade this past >>> weekend due to NPEs in a secondary index that were affecting certain >>> queries under 2.0.1. The servers are m1.large instances running AWS Linux >>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes. >>> All database contents are CQL tables with replication factor of 3, and the >>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver. >>> >>> In testing with the application, I noticed this afternoon that the >>> contents of the 3 nodes differed in their respective copies of the same >>> table for newly written data, for time periods exceeding several minutes, >>> as reported by cqlsh on each node. Specifying different hosts from the same >>> server using cqlsh also exhibited timeouts on multiple attempts to connect, >>> and on executing some queries, though they eventually succeeded in all >>> cases, and eventually the data in all nodes was fully replicated. >>> >>> The AWS servers have a security group with only ports 22, 7000, 9042, >>> and 9160 open. >>> >>> At this time, it seems that either I am still missing something in my >>> cluster configuration, or maybe there are other ports that are needed for >>> inter-node communication. >>> >>> Any advice/suggestions would be appreciated. >>> >>> >>> >>> -- >>> Steve Robenalt >>> Software Architect >>> HighWire | Stanford University >>> 425 Broadway St, Redwood City, CA 94063 >>> >>> srobe...@stanford.edu >>> http://highwire.stanford.edu >>> >>> >>> >>> >>> >>> >> > > > -- > Steve Robenalt > Software Architect > HighWire | Stanford University > 425 Broadway St, Redwood City, CA 94063 > > srobe...@stanford.edu > http://highwire.stanford.edu > > > > > > -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu