Looks like the read timeouts were a result of a bug that will be fixed in 2.0.3.
I found this question on the Datastax Java Driver mailing list: https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/ao1ohSLpjRM which led me to: https://issues.apache.org/jira/browse/CASSANDRA-6299 I built and deployed a 2.0.3 snapshot this morning, which includes this fix, and my cluster is now behaving normally (no read timeouts so far). On Tue, Nov 19, 2013 at 4:55 PM, Steven A Robenalt <srobe...@stanford.edu>wrote: > It seems that with NTP properly configured, the replication is now working > as expected, but there are still a lot of read timeouts. The > troubleshooting continues... > > > On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt > <srobe...@stanford.edu>wrote: > >> Thanks Michael, I will try that out. >> >> >> On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael < >> michael.la...@nytimes.com> wrote: >> >>> We had a similar problem when our nodes could not sync using ntp due to >>> VPC ACL settings. -ml >>> >>> >>> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt < >>> srobe...@stanford.edu> wrote: >>> >>>> Hi all, >>>> >>>> I am attempting to bring up our new app on a 3-node cluster and am >>>> having problems with frequent read timeouts and slow inter-node >>>> replication. Initially, these errors were mostly occurring in our app >>>> server, affecting 0.02%-1.0% of our queries in an otherwise unloaded >>>> cluster. No exceptions were logged on the servers in this case, and reads >>>> in a single node environment with the same code and client driver virtually >>>> never see exceptions like this, so I suspect problems with the >>>> inter-cluster communication between nodes. >>>> >>>> The 3 nodes are deployed in a single AWS VPC, and are all in a common >>>> subnet. The Cassandra version is 2.0.2 following an upgrade this past >>>> weekend due to NPEs in a secondary index that were affecting certain >>>> queries under 2.0.1. The servers are m1.large instances running AWS Linux >>>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes. >>>> All database contents are CQL tables with replication factor of 3, and the >>>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver. >>>> >>>> In testing with the application, I noticed this afternoon that the >>>> contents of the 3 nodes differed in their respective copies of the same >>>> table for newly written data, for time periods exceeding several minutes, >>>> as reported by cqlsh on each node. Specifying different hosts from the same >>>> server using cqlsh also exhibited timeouts on multiple attempts to connect, >>>> and on executing some queries, though they eventually succeeded in all >>>> cases, and eventually the data in all nodes was fully replicated. >>>> >>>> The AWS servers have a security group with only ports 22, 7000, 9042, >>>> and 9160 open. >>>> >>>> At this time, it seems that either I am still missing something in my >>>> cluster configuration, or maybe there are other ports that are needed for >>>> inter-node communication. >>>> >>>> Any advice/suggestions would be appreciated. >>>> >>>> >>>> >>>> -- >>>> Steve Robenalt >>>> Software Architect >>>> HighWire | Stanford University >>>> 425 Broadway St, Redwood City, CA 94063 >>>> >>>> srobe...@stanford.edu >>>> http://highwire.stanford.edu >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Steve Robenalt >> Software Architect >> HighWire | Stanford University >> 425 Broadway St, Redwood City, CA 94063 >> >> srobe...@stanford.edu >> http://highwire.stanford.edu >> >> >> >> >> >> > > > -- > Steve Robenalt > Software Architect > HighWire | Stanford University > 425 Broadway St, Redwood City, CA 94063 > > srobe...@stanford.edu > http://highwire.stanford.edu > > > > > > -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu