Based on my personal experience, the combination of slow read queries
and low CPU usage is often an indicator of bad table schema design
(e.g.: large partitions) or bad query (e.g. without partition key).
Check the Cassandra logs first, is there any long stop-the-world GC?
tombstone warning? anything else that's out of ordinary? Check the
output from "nodetool tpstats", is there any pending or blocked tasks?
Which thread pool(s) are they in? Is there a high number of dropped
messages? If you can't find anything useful from the Cassandra server
logs and "nodetool tpstats", try to get a few slow queries from your
application's log, and run them manually in the cqlsh. Are the results
very large? How long do they take?
Regarding some of your observations:
/> CPU load is around 20-25% - so we have lots of spare capacity/
Is it very few threads each uses nearly 100% of a CPU core? If so, what
are those threads? (I find the ttop command from the sjk tool
<https://github.com/aragozin/jvm-tools> very helpful)
/> network load is around 50% of the full available bandwidth/
This sounds alarming to me. May I ask what's the full available
bandwidth? Do you have a lots of CPU time spent in sys (vs user) mode?
On 05/03/2021 14:48, Attila Wind wrote:
Hi guys,
I have a DevOps related question - hope someone here could give some
ideas/pointers...
We are running a 3 nodes Cassandra cluster
Recently we realized we do have performance issues. And based on
investigation we took it seems our bottleneck is the Cassandra
cluster. The application layer is waiting a lot for Cassandra ops. So
queries are running slow on Cassandra side however due to our
monitoring it looks the Cassandra servers still have lots of free
resources...
The Cassandra machines are virtual machines (we do own the physical
hosts too) built with kvm - with 6 CPU cores (3 physical) and 32GB RAM
dedicated to it.
We are using Ubuntu Linux 18.04 distro - everywhere the same version
(the physical and virtual host)
We are running Cassandra 4.0-alpha4
What we see is
* CPU load is around 20-25% - so we have lots of spare capacity
* iowait is around 2-5% - so disk bandwidth should be fine
* network load is around 50% of the full available bandwidth
* loadavg is max around 4 - 4.5 but typically around 3 (because of
the cpu count 6 should represent 100% load)
and still, query performance is slow ... and we do not understand what
could hold Cassandra back to fully utilize the server resources...
We are clearly missing something!
Anyone any idea / tip?
thanks!
--
Attila Wind
http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932