I found the reason of my server freeze: COMMIT-LOG-WRITER thread is gone, dead, so the blocking queue in PeriodicCommitLogExecutorService is full, then all mutationStage jobs are stuck on the mutations flushing.
the COMMIT-LOG-WRITER thread died because at one time the disk was full, I cleaned up the disk space (not deleting cassandra files, but other files), but then since the thread is gone, system is still stuck. so I had to restart the server. is it better to let the WRITER thread handle file system exceptions or let it die? granted letting disk go full is not a good practice, but letting the system proceed after disk is freed seems a more natural expectation. Thanks Yang