Hello Ashish,
I don't think Cassandra exposes any metrics like that via the JMX
interface (which is where the Prometheus JMX exporter is getting the
metrics from). However, you do have a few other options to achieve the
same goal, such as request tracing (nodetool settraceprobability), slow
log (slow_query_log_timeout_in_ms in the cassandra.yaml, but be mindful
that it can be misleading, as all queries during a log STW GC pause will
be logged as slow) and the new full query log
(full_query_logging_options in the cassandra.yaml) feature in Cassandra 4.
But, to be hones, I don't think a single read on one 80MB partition will
cause any GC issue at all. It's more likely to be a range query (e.g.:
SELECT ... FROM ... WHERE TOKEN(pk) > m AND TOKEN(pk) < n), repeatedly
read the same partition in a very short period of time (e.g.: bad retry
policy, or hot partitions), very bursty requests (and the peak has
exceeded the node's capacity), or a large number of tombstones (check
the logs). Or more often, a combination of those.
I'm pretty interested to find out how do you know it's the table 'x'
responsible for the long GC pauses? Have you got some concrete evidence
about it? If you aren't sure about that, you may want to keep your mind
open as the root cause could be something, such as repair sessions
(merkle tree size) and hinted handoff (malformed writes can stuck in
hinted handoff and gets retried repeatedly until it expires, at least
this was true in early 3.x versions).
If you want to dig deep into the root cause, I personally like the
approach to take and analyse a snapshot of the JVM heap during the long
GC pause if the pause is long enough (a few seconds should be
sufficient). You can write a script to read the GC log file and take a
heap dump when the JVM is in a long STW GC, and you will then be able to
see what exactly is in the heap when it happens. I find the heap dump
often gives me very useful insight about the accurate and exact cause of
the long GC pause. Additionally and alternatively, if you are using ZFS
for the Cassandra data directory like I do, the ZFS debug log can give
you a lot more insight about what exactly had Cassandra read
from/written to the filesystem, accurate to the specific filename and
offset, and from there you will be able to reconstruct what had happened
to Cassandra right before a long GC pause.
Happy GC issue hunting and perhaps GC tuning too :-)
Cheers,
Bowen
On 11/09/2021 16:39, MyWorld wrote:
Hi all,
We are using Prometheus + grafana for monitoring apache cassandra with
scrape interval of 15s. We have a table 'x' with partition size
varying from 2mb to 80mb.
We know there are few big partition entries present in this table and
my objective is to monitor when this big partition entry is read from
Cassandra(as it can be a cause of large GC pause)
Now in Prometheus how can I figure out the "size of total data read"
from table 'x' in last 15s. What formula can be applied?
Regards,
Ashish