I've gotten timeouts on clients when using Cassandra 1.1.8 in a cluster of 12 nodes, but I don't see the same behavior when using Cassandra 1.0.10. So, to do a controlled experiment, the following was tried:
1. Started with Cassandra 1.0.10. Built a database and ran our test tools against it to build a database 2. Ran workload to ensure no timeout problems were seen. Stopped the load 3. Upgraded only 2 of the nodes in the cluster to 1.1.8. In the cluster of 12 nodes. Ran scrub afterwards as document states to convert sstables to 1.1 format and to fix level-manifest problems. 4. Started load back up 5. After some time, started seeing timeouts on the client for requests that go to the 1.1.8 nodes (i.e. requests sent to those nodes as the coordinator node) There appears to be a pattern in these timeouts in that a large burst of them occur every 10 minutes (on the 10 minute boundary of the hour, like 10:10:XX, 10:20:YY, 10:30:ZZ etc.). All clients see the timeouts from those two 1.1.8 nodes at the same exact time. The workload is not I/O bound at this point and requests are not being dropped either based on tpstat output. I don't see hinted handoff messages either as I believe that happens every 10 minutes. Key cache size is set to 2.7GB and memtable size is 1/3 of heap (2.7GB). The key cache memory usage is same as 1.0.10 based on heap size calculator. There are no GC pauses or any type of heap pressure messages in the logs. This is with Java 1.6.0.38. Does anyone know of some periodic tasks in Cassandra 1.1 that happens every 10 minutes that could explain this problem or have any ideas? Thanks