Hello Jeff,
I'm not a consultant, but do have some experience on troubleshooting
this type of issues.
The first thing in troubleshooting is gathering information. You don't
want to troubleshoot issues blindly.
Some (but not all) important information are CPU usage, network IO, disk
IO, JVM heap usage, Cassandra query latency, queries/s, dropped
messages, pending compactions, GC logs, Cassandra logs and system logs.
Also, how is the repair run? Is it subrange repair? Is it incremental
repair? On Cassandra 3.0.x and 3.x, it's recommended to do subrange full
(non-incremental) repairs, because incremental repair before Cassandra
4.0 has known issues and can cause excessive anti-compaction. If the
cluster had ever ran an incremental repair, there's some extra steps
needed to switch to full repairs. Skipping these extra steps will lead
to the previously repaired but now outdated data permanently remain on
all nodes, which will not only waste disk space, but also slow down
queries and increase GC pressure.
Cheers,
Bowen
On 27/09/2024 01:33, Jeff Masud wrote:
I'm hoping someone can recommend a good Cassandra consultant.
We have 12 node cluster spanning across 2 data centers, when doing
repairs a node will spike and be on responsive or completely die, I’m
assuming it’s related to very high GC times.
We're currently running 3.0.30 and looking to upgrade to a newer
version once we can get a repair successfully.
Please reach out to me directly.
Thanks
Jeff
--
Jeff Masud
Deasil Works
818-945-0821 x107
310-918-5333 Mobile
jeff@deasil.works <mailto:jeff@deasil.works>