On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.mas...@opigram.com>wrote:
> > thanks for clarifying this. So you're saying the difference between the > global Read Request latency in opscenter, and the column family specific > one is in the effort coordinating a validated read across multiple replicas? Yes. > Is this not part of what Hector does for itself? > No. Here's the basic order of events: 1) Hector sends a request to some node in the cluster, which will act as the coordinator. 2) The coordinator then sends the actual read requests out to each of the (RF) replicas. 3a) The coordinator waits for responses from the replicas; how many it waits for depends on the consistency level. 3b) The replicas perform actual cache/memtable/sstable reads and respond to the coordinator when complete 4) Once the required number of replicas have responded, the coordinator replies to the client (Hector). The Read Request Latency metric is measuring the time taken in steps 2 through 4. The CF Local Read Latency metric is only capturing the time taken in step 3b. > Essentially, I'm looking to see whether I can use this to derive where any > extra latency from a client request comes from. > Yes, using the two numbers in conjunction can be very informative. Also, you might be interested in the new query tracing feature in 1.2, which shows very detailed steps and their latencies. > > As for names, I'd suggest "cluster coordinated read request latency", bit > of a mouthful, I know. > Awesome, thanks for your input. > > Is there anywhere I can find concrete definitions of what the stats in > OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem quite > ambiguous. > This has pretty good writeups of each: http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics > > I still think that the data resolution that OpsCenter gives makes it more > suitable for trending/alerting rather than chasing down tricky performance > issues. This sort of investigation work is what I do for a living, I > typically use intervals of 10 seconds or lower, and don't average my data. > Although, storing your data inside the database your measuring does > restrict your options a little :-) True, there's a limit to what you can detect with 60 second resolution. We've considered being able to report metrics at a finer resolution without durably storing them anywhere, which would be useful for when you're actively watching the cluster. Thanks! -- Tyler Hobbs DataStax <http://datastax.com/>