Yes, they are older Dell PowerEdges. I have to check whether there's newer firmware, but we've been running Ceph for years without these problems.

I checked the logs on the host on which I had a lockup just an hour ago, but there's nothing besides the expected hardreset messages. There are two older watchdog messages, but they are from March:

--------------------------------------------------------------------------------
SeqNumber       = 2089
Message ID      = ASR0000
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2025-03-27 07:16:03
Message         = The watchdog timer expired.
RawEventData    = 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF

FQDD            = WatchdogTimer.iDRAC.1
--------------------------------------------------------------------------------
SeqNumber       = 2088
Message ID      = ASR0000
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2025-03-27 07:06:41
Message         = The watchdog timer expired.
RawEventData    = 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF

FQDD            = WatchdogTimer.iDRAC.1
--------------------------------------------------------------------------------

I grepped the logs of another host where it happened, but couldn't find any watchtdog messages there. I believe it's also unlikely that suddenly all MDS hosts (we have five active, five hot standbys, and one cold standby) start having hardware issues. I also ran a memtest on one of the hosts last week and couldn't find anything there either.



On 16/04/2025 15:14, Anthony D'Atri wrote:
Curious, are your systems Dells?  If so you might see some improvement from 
running DSU to update all the firmware.  It might also be illuminating to run 
`racadm lclog view`

On Apr 16, 2025, at 8:32 AM, Janek Bevendorff <janek.bevendo...@uni-weimar.de> 
wrote:

Hi,

Since the latest Reef update I have the problem that some of my hosts suddenly 
go into a state where all CPUs are stuck in kernel mode causing all daemons on 
that host to become unresponsive. When I connect to the IPMI console, I see a 
lot of messages like:

watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]

(it's basically a list of all processes running on the machine).

Usually, this resolves itself after several minutes, but sometimes I have to 
hardreset the host. When this happens, all daemons are marked as down and I 
cannot interact with the host at all. I don't know what causes this but, I 
think it happens primarily on the hosts where my MDS run and it seems to be 
triggered by events such as cluster rebalances, MDS restarts, or just randomly.

I found a few reports about similar issues on the bug tracker and mailing list, 
but they are all very unspecific, unanswered, or more than 6 years old.

Is there any way I can debug this? I upgraded to Squid already, but that didn't 
solve the problem. I also had massive issues with this during the upgrade. 
Particularly at the end when the MDS were upgraded, I had constant struggles 
with it. I had to set the noout flag and then literally sit next to it to 
resume the upgrade every few minutes until it finally went through, because 
random MDS hosts went intermittently dark all the time.

All hosts run Ubuntu 22.04 with kernel 6.8.0.

Any ideas? Thanks!
Janek

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to