On 31/12/12 18:45, Tyler Hobbs wrote:
On Mon, Dec 31, 2012 at 11:24 AM, James Masson <james.mas...@opigram.com
<mailto:james.mas...@opigram.com>> wrote:
Well, it turns out the Read-Request Latency graph in Ops-Center is
highly misleading.
Using jconsole, the read-latency for the column family in question
is actually normally around 800 microseconds, punctuated by
occasional big spikes that drive up the averages.
Towards the end of the batch process, the Opscenter reported average
latency is up above 4000 microsecs, and forced compactions no longer
help drive the latency down again.
I'm going to stop relying on OpsCenter for data for performance
analysis metrics, it just doesn't have the resolution.
James, it's worth pointing out that Read Request Latency in OpsCenter is
measuring at the coordinator level, so it includes the time spent
sending requests to replicas and waiting for a response. There's
another latency metric that is per-column family named Local Read
Latency; it sounds like this is the equivalent number that you were
looking at in jconsole. This metric basically just includes the time to
read local caches/memtables/sstables.
We are looking to rename one or both of the metrics for clarity; any
input here would be helpful. For example, we're considering "Coordinated
Read Request Latency" or "Client Read Request Latency" in place of just
"Read Request Latency".
--
Tyler Hobbs
DataStax <http://datastax.com/>
Hi Tyler,
thanks for clarifying this. So you're saying the difference between the
global Read Request latency in opscenter, and the column family specific
one is in the effort coordinating a validated read across multiple
replicas? Is this not part of what Hector does for itself?
Essentially, I'm looking to see whether I can use this to derive where
any extra latency from a client request comes from.
As for names, I'd suggest "cluster coordinated read request latency",
bit of a mouthful, I know.
Is there anywhere I can find concrete definitions of what the stats in
OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem
quite ambiguous.
I still think that the data resolution that OpsCenter gives makes it
more suitable for trending/alerting rather than chasing down tricky
performance issues. This sort of investigation work is what I do for a
living, I typically use intervals of 10 seconds or lower, and don't
average my data. Although, storing your data inside the database your
measuring does restrict your options a little :-)
regards
James M