Hello. I had the same issues on full repair. I've checked on various GC settings, the most performant is ZGC on Java 11, but I had some stability issues. I left G1GC settings from 3.11.x and got the same issues as yours: CPU load over 90 %, and growing count of open file descriptors (up to max allowed). It looks like the repair job doesn't finish repairs on segments and waits when all segments get repaired. so the job itself takes more and more resources until the node may become irresponsible or all segments will be repaired.
Here my GC settings (Cassandra 4.0.2): G1GC: -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=16 -XX:ConcGCThreads=16 ZGC: -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=8 -XX:ParallelGCThreads=8 -XX:+UseTransparentHugePages ZGC works more effectively on repair tasks (it runs way faster, and doesn't overuse much system resources), but I get random crushes on various nodes, so I can't use it as production ready. пн, 13 июн. 2022 г. в 15:26, onmstester onmstester <onmstes...@zoho.com>: > > Hi > > I've been testing Cassandra 4.0.3 and when i run rull repair (on a single > table), all of bandwidth of my 1G link would be saturated (also CPU became > > 80% and disk util is 100%), stream_throughput been set to 200 Mb but not > affecting repair, all other configs are default and i could not find any > other configuration related to limiting the throughput of repair. > > IMHO, having a node with saturated resources would make the whole cluster's > response time be slow. > Any workaround for this? Is this some sort of bug? > > Best Regards > > -- >From Siberia with Love!