I've gotten timeouts on clients when using Cassandra 1.1.8 in a cluster of
12 nodes, but I don't see the same behavior when using Cassandra 1.0.10.
So, to do a controlled experiment, the following was tried:

1. Started with Cassandra 1.0.10. Built a database and ran our test tools
against it to build a database
2. Ran workload to ensure no timeout problems were seen. Stopped the load
3. Upgraded only 2 of the nodes in the cluster to 1.1.8. In the cluster of
12 nodes. Ran scrub afterwards as document states to convert sstables to
1.1 format and to fix level-manifest problems.
4. Started load back up
5. After some time, started seeing timeouts on the client for requests that
go to the 1.1.8 nodes (i.e. requests sent to those nodes as the coordinator
node)

There appears to be a pattern in these timeouts in that a large burst of
them occur every 10 minutes (on the 10 minute boundary of the hour, like
10:10:XX, 10:20:YY, 10:30:ZZ etc.). All clients see the timeouts from those
two 1.1.8 nodes at the same exact time. The workload is not I/O bound at
this point and requests are not being dropped either based on tpstat
output. I don't see hinted handoff messages either as I believe that
happens every 10 minutes. Key cache size is set to 2.7GB and memtable size
is 1/3 of heap (2.7GB). The key cache memory usage is same as 1.0.10 based
on heap size calculator. There are no GC pauses or any type of heap
pressure messages in the logs. This is with Java 1.6.0.38.

Does anyone know of some periodic tasks in Cassandra 1.1 that happens every
10 minutes that could explain this problem or have any ideas?

Thanks

Reply via email to