The only messages in OS system log were exactly the same as daemon.log. The hosts did not shut down, only the Cassandra service stopped. So dmesg has nothing.
The amount of data being written is not that great and GC times are always <1s. The only visible error-type messages are related to source systems being unavailable from time to time but there as nothing of this around the time that the service stopped I discounted these as a possible cause. I’ll just keep an eye out if it happens again. From: Bowen Song via user <user@cassandra.apache.org> Sent: Thursday, August 4, 2022 12:20 PM To: user@cassandra.apache.org Subject: Re: Service shutdown EXTERNAL Generally speaking, I've seen Cassandra process stopping for the following reasons: OOM killer JVM OOM Received a signal, such as SIGTERM and SIGKILL File IO error when disk_failure_policy or commit_failure_policy is set to die Hardware issues, such as memory corruption, causing Cassandra to crash Reaching ulimit resource limits, such as "too many open files" They all leave traces behind. You said you've checked OS logs, and you posted only the systemd logs from DAEMON.LOG. Have you checked "dmesg" output? Some system logs, such as OOM killer and MCE error logs, don't go into the DAEMON.LOG file. On 04/08/2022 11:00, Marc Hoppins wrote: Hulloa all, Service on two nodes stopped yesterday and I can find nothing to indicate why. I have checked Cassandra system.logs, gc.logs and debug.logs as well as OS logs and all I can see is the following - which is far from helpful: DAEMON.LOG Aug 3 11:39:12 cassandra19 systemd[1]: cassandra.service: Main process exited, code=exited, status=1/FAILURE Aug 3 11:39:12 cassandra19 systemd[1]: cassandra.service: Failed with result 'exit-code'. Aug 3 13:44:52 cassandra23 systemd[1]: cassandra.service: Main process exited, code=exited, status=1/FAILURE Aug 3 13:44:52 cassandra23 systemd[1]: cassandra.service: Failed with result 'exit-code'. Initially I thought that the reason the second node went down was because it had problems communicating with the other stopped node but with a gap of 2 hours it seems unlikely. If this occurs on any of these two nodes again I will probably increase logging level but to do so for every node in the hope that I pick something up is impractical. In the meantime, is there anything else I can look at which may deliver unto us more info? Marc