RE: Service shutdown

Marc Hoppins Thu, 04 Aug 2022 04:08:45 -0700

The only messages in OS system log were exactly the same as daemon.log.  The 
hosts did not shut down, only the Cassandra service stopped. So dmesg has 
nothing.


The amount of data being written is not that great and GC times are always <1s.

The only visible error-type messages are related to source systems being 
unavailable from time to time but there as nothing of this around the time that 
the service stopped I discounted these as a possible cause.

I’ll just keep an eye out if it happens again.

From: Bowen Song via user <user@cassandra.apache.org>
Sent: Thursday, August 4, 2022 12:20 PM
To: user@cassandra.apache.org
Subject: Re: Service shutdown

EXTERNAL

Generally speaking, I've seen Cassandra process stopping for the following 
reasons:

OOM killer
JVM OOM
Received a signal, such as SIGTERM and SIGKILL
File IO error when disk_failure_policy or commit_failure_policy is set to die
Hardware issues, such as memory corruption, causing Cassandra to crash
Reaching ulimit resource limits, such as "too many open files"

They all leave traces behind. You said you've checked OS logs, and you posted 
only the systemd logs from DAEMON.LOG. Have you checked "dmesg" output? Some 
system logs, such as OOM killer and MCE error logs, don't go into the 
DAEMON.LOG file.


On 04/08/2022 11:00, Marc Hoppins wrote:

Hulloa all,



Service on two nodes stopped yesterday and I can find nothing to indicate why.  
I have checked Cassandra system.logs, gc.logs and debug.logs as well as OS logs 
and all I can see is the following - which is far from helpful:



DAEMON.LOG

Aug  3 11:39:12 cassandra19 systemd[1]: cassandra.service: Main process exited, 
code=exited, status=1/FAILURE

Aug  3 11:39:12 cassandra19 systemd[1]: cassandra.service: Failed with result 
'exit-code'.



Aug  3 13:44:52 cassandra23 systemd[1]: cassandra.service: Main process exited, 
code=exited, status=1/FAILURE

Aug  3 13:44:52 cassandra23 systemd[1]: cassandra.service: Failed with result 
'exit-code'.



Initially I thought that the reason the second node went down was because it 
had problems communicating with the other stopped node but with a gap of 2 
hours it seems unlikely.  If this occurs on any of these two nodes again I will 
probably increase logging level but to do so for every node in the hope that I 
pick something up is impractical.



In the meantime, is there anything else I can look at which may deliver unto us 
more info?



Marc

RE: Service shutdown

Reply via email to