Hi all,

I am seeing a read operation delay in our small (3 node) cluster where I am
testing. The "normal" latency for these operations is < 2ms as recorded by
our load client. This holds easily beyond several hundred qps. However
there are times when all incoming queries (on a node-by-node basis) are
stalled anywhere from ~100-500ms, and then all "clear" and return at the
same time. This behavior is independent of the amount of load applied; Just
more queries get stalled at higher loads :). It seems like a "stall"
condition happens maybe every 30 seconds or so.

All of our column families are using replication factor 3 on 3 nodes, and
read consistency level ONE. The column family used for the test is not used
elsewhere, and is essentially empty.

We have verified at the network level that the request is "received" by the
machine, however the TRACE in cassandra does not even start for several
hundred milliseconds. All cassandra trace events happen very fast (< 2ms).
Once the query is complete, the response is sent promptly.

Here is a sample:

3 requests started at varying times, all returning at the same time (log
from load client)
2014-01-02 20:32:12,752 [14]  Start 12:32:12.612 End 12:32:12.752 Duration:
140.4811ms
2014-01-02 20:32:12,752 [12]  Start 12:32:12.456 End 12:32:12.752 Duration:
296.5098ms
2014-01-02 20:32:12,752 [7]    Start 12:32:12.316 End 12:32:12.752
Duration: 436.93ms

These three requests each hit the same server which was stalled in starting
those requests. All TRACE events in cassandra have the same timestamp, and
align with the END timestamp above.

Could this be garbage collection? It seems odd that the "delay" never seems
to happen in the middle of a request, only before it begins (at least
according to the trace events) I'm not sure how to proceed in
troubleshooting this.

Any additional information needed or suggestions on how to troubleshoot?
Many thanks in advance!
Thunder

P.S. Install information:
Datastax Community Cassandra v2.0.2 (3 nodes)
Datastax .net client (latest)
Cluster is production and running cluster-wide load ~500rps read, ~200rps
write

Reply via email to