Hello Jeff,

I'm not a consultant, but do have some experience on troubleshooting this type of issues.

The first thing in troubleshooting is gathering information. You don't want to troubleshoot issues blindly.

Some (but not all) important information are CPU usage, network IO, disk IO, JVM heap usage, Cassandra query latency, queries/s, dropped messages, pending compactions, GC logs, Cassandra logs and system logs.

Also, how is the repair run? Is it subrange repair? Is it incremental repair? On Cassandra 3.0.x and 3.x, it's recommended to do subrange full (non-incremental) repairs, because incremental repair before Cassandra 4.0 has known issues and can cause excessive anti-compaction. If the cluster had ever ran an incremental repair, there's some extra steps needed to switch to full repairs. Skipping these extra steps will lead to the previously repaired but now outdated data permanently remain on all nodes, which will not only waste disk space, but also slow down queries and increase GC pressure.

Cheers,
Bowen

On 27/09/2024 01:33, Jeff Masud wrote:

I'm hoping someone can recommend a good Cassandra consultant.

We have 12 node cluster spanning across 2 data centers, when doing repairs a node will spike and be on responsive or completely die, I’m assuming it’s related to very high GC times.

We're currently running 3.0.30 and looking to upgrade to a newer version once we can get a repair successfully.

Please reach out to me directly.

Thanks

Jeff

--

Jeff Masud

Deasil Works

818-945-0821 x107

310-918-5333 Mobile

jeff@deasil.works <mailto:jeff@deasil.works>

Reply via email to