Re: underutilized servers

Attila Wind Fri, 05 Mar 2021 10:37:51 -0800

Thanks for the answers @Sean and @Bowen !!!

First of all, this article described very similar thing we experience -let me share

https://www.senticore.com/overcoming-cassandra-write-performance-problems/
we are studying that now


Furthermore

 * yes, we have some level of unbalanced data which needs to be
   improved - this is on our backlog so should be done
 * and yes we do see clearly that this unbalanced data is slowing down
   everything in Cassandra (there is proof of it in our
   Prometheus+Grafana based monitoring)
 * we will do this optimization now definitely (luckily we have plan
   already)

@Sean:

 * "Since these are VMs, is there any chance they are competing for
   resources on the same physical host?"
   We are splitting the physical hardware into 2 VMs - and resources
   (cpu cores, disks, ram) all assigned in a dedicated fashion to the
   VMs without intersection
   BUT!!
   You are right... There is one thing we are sharing: network
   bandwidth... and actually that one does not come up in the "iowait"
   part for sure. We will further analyze into this direction
   definitely because from the monitoring as far as I see yeppp, we
   might hit the wall here
 * consistency level: we are using LOCAL_ONE
 * "Does the app use prepared statements that are only prepared once
   per app invocation?"
   Yes and yes :-)
 * "Any LWT/”if exists” in your code?"
   No. We go with RF=2 so we even can not use this (as LWT goes with
   QUORUM and in our case this would mean we could not tolerate losing
   a node... not good... so no)

@Bowen:

 * The bandwidth limit is 1Gbit/sec (so 120Mb/sec) BUT it is the limit
   of the physical host - so our 2 VMs competing here. Possible that
   Cassandra VM has ~50-70% of it...
 * The CPU's "system" value shows 8-12%
 * "nodetool tpstats"
   whooa I never used it, we definitely need some learning here to even
   understand the output... :-) But I copy that here to the bottom ...
   maybe clearly shows something to someone who can read it...

so, "nodetool tpstats" from one of the nodes

Pool Name Active Pending Completed Blocked All time blockedReadStage 0 0 6248406 0 0CompactionExecutor 0 0 168525 0 0MutationStage 0 0 25116817 0 0MemtableReclaimMemory 0 0 17636 0 0PendingRangeCalculator 0 0 7 0 0GossipStage 0 0 324388 0 0SecondaryIndexManagement 0 0 0 0 0HintsDispatcher 1 0 75 0 0Repair-Task 0 0 1 0 0RequestResponseStage 0 0 31186150 0 0Native-Transport-Requests 0 0 22827219 0 0CounterMutationStage 0 0 12560992 0 0MemtablePostFlush 0 0 19259 0 0PerDiskMemtableFlushWriter_0 0 0 17636 0 0ValidationExecutor 0 0 48 0 0Sampler 0 0 0 0 0ViewBuildExecutor 0 0 0 0 0MemtableFlushWriter 0 0 17636 0 0InternalResponseStage 0 0 44658 0 0AntiEntropyStage 0 0 161 0 0CacheCleanupExecutor 0 0 0 0 0

Message type Dropped Latency waiting in queue(micros) 50% 95% 99% MaxREAD_RSP 18 1629.72 8409.01 155469.30 386857.37RANGE_REQ 0 0.00 0.00 0.00 0.00PING_REQ 0 0.00 0.00 0.00 0.00_SAMPLE 0 0.00 0.00 0.00 0.00VALIDATION_RSP 0 0.00 0.00 0.00 0.00SCHEMA_PULL_RSP 0 0.00 0.00 0.00 0.00SYNC_RSP 0 0.00 0.00 0.00 0.00SCHEMA_VERSION_REQ 0 0.00 0.00 0.00 0.00HINT_RSP 0 943.13 3379.39 5839.59 52066.35BATCH_REMOVE_RSP 0 0.00 0.00 0.00 0.00PAXOS_COMMIT_REQ 0 0.00 0.00 0.00 0.00SNAPSHOT_RSP 0 0.00 0.00 0.00 0.00COUNTER_MUTATION_REQ 94 1358.10 5839.59 14530.76 464228.84GOSSIP_DIGEST_SYN 0 1358.10 5839.59 25109.16 25109.16PAXOS_PREPARE_REQ 0 0.00 0.00 0.00 0.00PREPARE_MSG 0 0.00 0.00 0.00 0.00PAXOS_COMMIT_RSP 0 0.00 0.00 0.00 0.00HINT_REQ 0 0.00 0.00 0.00 0.00BATCH_REMOVE_REQ 0 0.00 0.00 0.00 0.00STATUS_RSP 0 0.00 0.00 0.00 0.00READ_REPAIR_RSP 0 0.00 0.00 0.00 0.00GOSSIP_DIGEST_ACK2 0 1131.75 5839.59 7007.51 7007.51CLEANUP_MSG 0 0.00 0.00 0.00 0.00REQUEST_RSP 0 0.00 0.00 0.00 0.00TRUNCATE_RSP 0 0.00 0.00 0.00 0.00REPLICATION_DONE_RSP 0 0.00 0.00 0.00 0.00SNAPSHOT_REQ 0 0.00 0.00 0.00 0.00ECHO_REQ 0 0.00 0.00 0.00 0.00PREPARE_CONSISTENT_REQ 0 0.00 0.00 0.00 0.00FAILURE_RSP 9 0.00 0.00 0.00 0.00BATCH_STORE_RSP 0 0.00 0.00 0.00 0.00SCHEMA_PUSH_RSP 0 0.00 0.00 0.00 0.00MUTATION_RSP 17 1131.75 4866.32 8409.01 464228.84FINALIZE_PROPOSE_MSG 0 0.00 0.00 0.00 0.00ECHO_RSP 0 0.00 0.00 0.00 0.00INTERNAL_RSP 0 0.00 0.00 0.00 0.00FAILED_SESSION_MSG 0 0.00 0.00 0.00 0.00_TRACE 0 0.00 0.00 0.00 0.00SCHEMA_VERSION_RSP 0 0.00 0.00 0.00 0.00FINALIZE_COMMIT_MSG 0 0.00 0.00 0.00 0.00SNAPSHOT_MSG 0 0.00 0.00 0.00 0.00PREPARE_CONSISTENT_RSP 0 0.00 0.00 0.00 0.00PAXOS_PROPOSE_REQ 0 0.00 0.00 0.00 0.00PAXOS_PREPARE_RSP 0 0.00 0.00 0.00 0.00MUTATION_REQ 265 1358.10 5839.59 223875.79 802187.44READ_REQ 45 1629.72 5839.59 36157.19 386857.37PING_RSP 0 0.00 0.00 0.00 0.00RANGE_RSP 0 0.00 0.00 0.00 0.00VALIDATION_REQ 0 0.00 0.00 0.00 0.00SYNC_REQ 0 0.00 0.00 0.00 0.00_TEST_1 0 0.00 0.00 0.00 0.00GOSSIP_SHUTDOWN 0 0.00 0.00 0.00 0.00TRUNCATE_REQ 0 0.00 0.00 0.00 0.00_TEST_2 0 0.00 0.00 0.00 0.00GOSSIP_DIGEST_ACK 0 1629.72 5839.59 43388.63 43388.63SCHEMA_PUSH_REQ 0 0.00 0.00 0.00 0.00FINALIZE_PROMISE_MSG 0 0.00 0.00 0.00 0.00BATCH_STORE_REQ 0 0.00 0.00 0.00 0.00COUNTER_MUTATION_RSP 96 1358.10 4866.32 8409.01 464228.84REPAIR_RSP 0 0.00 0.00 0.00 0.00STATUS_REQ 0 0.00 0.00 0.00 0.00SCHEMA_PULL_REQ 0 0.00 0.00 0.00 0.00READ_REPAIR_REQ 0 0.00 0.00 0.00 0.00ASYMMETRIC_SYNC_REQ 0 0.00 0.00 0.00 0.00REPLICATION_DONE_REQ 0 0.00 0.00 0.00 0.00PAXOS_PROPOSE_RSP 0 0.00 0.00 0.00 0.00



Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


05.03.2021 17:45 keltezéssel, Bowen Song írta:

Based on my personal experience, the combination of slow read queriesand low CPU usage is often an indicator of bad table schema design(e.g.: large partitions) or bad query (e.g. without partition key).Check the Cassandra logs first, is there any long stop-the-world GC?tombstone warning? anything else that's out of ordinary? Check theoutput from "nodetool tpstats", is there any pending or blocked tasks?Which thread pool(s) are they in? Is there a high number of droppedmessages? If you can't find anything useful from the Cassandra serverlogs and "nodetool tpstats", try to get a few slow queries from yourapplication's log, and run them manually in the cqlsh. Are the resultsvery large? How long do they take?
Regarding some of your observations:

/> CPU load is around 20-25% - so we have lots of spare capacity/
Is it very few threads each uses nearly 100% of a CPU core? If so,what are those threads? (I find the ttop command from the sjk tool<https://github.com/aragozin/jvm-tools> very helpful)
/> network load is around 50% of the full available bandwidth/
This sounds alarming to me. May I ask what's the full availablebandwidth? Do you have a lots of CPU time spent in sys (vs user) mode?
On 05/03/2021 14:48, Attila Wind wrote:
Hi guys,
I have a DevOps related question - hope someone here could give someideas/pointers...
We are running a 3 nodes Cassandra cluster
Recently we realized we do have performance issues. And based oninvestigation we took it seems our bottleneck is the Cassandracluster. The application layer is waiting a lot for Cassandra ops. Soqueries are running slow on Cassandra side however due to ourmonitoring it looks the Cassandra servers still have lots of freeresources...
The Cassandra machines are virtual machines (we do own the physicalhosts too) built with kvm - with 6 CPU cores (3 physical) and 32GBRAM dedicated to it.We are using Ubuntu Linux 18.04 distro - everywhere the same version(the physical and virtual host)
We are running Cassandra 4.0-alpha4

What we see is

  * CPU load is around 20-25% - so we have lots of spare capacity
  * iowait is around 2-5% - so disk bandwidth should be fine
  * network load is around 50% of the full available bandwidth
  * loadavg is max around 4 - 4.5 but typically around 3 (because of
    the cpu count 6 should represent 100% load)
and still, query performance is slow ... and we do not understandwhat could hold Cassandra back to fully utilize the server resources...
We are clearly missing something!
Anyone any idea / tip?

thanks!

--
Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932

Re: underutilized servers

Reply via email to