On 02/01/13 16:18, Tyler Hobbs wrote:
On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.mas...@opigram.com <mailto:james.mas...@opigram.com>> wrote:
>
1) Hector sends a request to some node in the cluster, which will act as the coordinator. 2) The coordinator then sends the actual read requests out to each of the (RF) replicas. 3a) The coordinator waits for responses from the replicas; how many it waits for depends on the consistency level. 3b) The replicas perform actual cache/memtable/sstable reads and respond to the coordinator when complete 4) Once the required number of replicas have responded, the coordinator replies to the client (Hector). The Read Request Latency metric is measuring the time taken in steps 2 through 4. The CF Local Read Latency metric is only capturing the time taken in step 3b.
Great, that's exactly the level of detail I'm looking for.
Is there anywhere I can find concrete definitions of what the stats in OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem quite ambiguous. This has pretty good writeups of each: http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics
Your description above was much better :-) I'm more interested in docs for the raw metrics provided in JMX.
I still think that the data resolution that OpsCenter gives makes it more suitable for trending/alerting rather than chasing down tricky performance issues. This sort of investigation work is what I do for a living, I typically use intervals of 10 seconds or lower, and don't average my data. Although, storing your data inside the database your measuring does restrict your options a little :-) True, there's a limit to what you can detect with 60 second resolution. We've considered being able to report metrics at a finer resolution without durably storing them anywhere, which would be useful for when you're actively watching the cluster.
That would be a great feature, but it's quite difficult taking high-resolution data capture without disturbing the system you're trying to measure.
Perhaps worth taking the data-capture points off-list? James M