Hi all, I am seeing a read operation delay in our small (3 node) cluster where I am testing. The "normal" latency for these operations is < 2ms as recorded by our load client. This holds easily beyond several hundred qps. However there are times when all incoming queries (on a node-by-node basis) are stalled anywhere from ~100-500ms, and then all "clear" and return at the same time. This behavior is independent of the amount of load applied; Just more queries get stalled at higher loads :). It seems like a "stall" condition happens maybe every 30 seconds or so.
All of our column families are using replication factor 3 on 3 nodes, and read consistency level ONE. The column family used for the test is not used elsewhere, and is essentially empty. We have verified at the network level that the request is "received" by the machine, however the TRACE in cassandra does not even start for several hundred milliseconds. All cassandra trace events happen very fast (< 2ms). Once the query is complete, the response is sent promptly. Here is a sample: 3 requests started at varying times, all returning at the same time (log from load client) 2014-01-02 20:32:12,752 [14] Start 12:32:12.612 End 12:32:12.752 Duration: 140.4811ms 2014-01-02 20:32:12,752 [12] Start 12:32:12.456 End 12:32:12.752 Duration: 296.5098ms 2014-01-02 20:32:12,752 [7] Start 12:32:12.316 End 12:32:12.752 Duration: 436.93ms These three requests each hit the same server which was stalled in starting those requests. All TRACE events in cassandra have the same timestamp, and align with the END timestamp above. Could this be garbage collection? It seems odd that the "delay" never seems to happen in the middle of a request, only before it begins (at least according to the trace events) I'm not sure how to proceed in troubleshooting this. Any additional information needed or suggestions on how to troubleshoot? Many thanks in advance! Thunder P.S. Install information: Datastax Community Cassandra v2.0.2 (3 nodes) Datastax .net client (latest) Cluster is production and running cluster-wide load ~500rps read, ~200rps write