For whatever reason, in recent years I’ve seen these more often with Dells than 
other systems.  My first thought was that maybe you were running an ancient 
kernel, but then I saw that you aren’t.  Is the kernel you’re running the stock 
one that comes with your distribution?  I’ve seen CPU reset events on R750s 
running an elrepo kernel.

I suspect that some code change may have tickled a latent issue that perhaps 
you were fortunate to have not previously run into, but this is entirely 
speculation.

> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff 
> <janek.bevendo...@uni-weimar.de> wrote:
> 
> Yes, they are older Dell PowerEdges. I have to check whether there's newer 
> firmware, but we've been running Ceph for years without these problems.
> 
> I checked the logs on the host on which I had a lockup just an hour ago, but 
> there's nothing besides the expected hardreset messages. There are two older 
> watchdog messages, but they are from March:
> 
> --------------------------------------------------------------------------------
> SeqNumber       = 2089
> Message ID      = ASR0000
> Category        = System
> AgentID         = SEL
> Severity        = Critical
> Timestamp       = 2025-03-27 07:16:03
> Message         = The watchdog timer expired.
> RawEventData    = 
> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
> 
> FQDD            = WatchdogTimer.iDRAC.1
> --------------------------------------------------------------------------------
> SeqNumber       = 2088
> Message ID      = ASR0000
> Category        = System
> AgentID         = SEL
> Severity        = Critical
> Timestamp       = 2025-03-27 07:06:41
> Message         = The watchdog timer expired.
> RawEventData    = 
> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
> 
> FQDD            = WatchdogTimer.iDRAC.1
> --------------------------------------------------------------------------------
> 
> I grepped the logs of another host where it happened, but couldn't find any 
> watchtdog messages there. I believe it's also unlikely that suddenly all MDS 
> hosts (we have five active, five hot standbys, and one cold standby) start 
> having hardware issues. I also ran a memtest on one of the hosts last week 
> and couldn't find anything there either.
> 
> 
> 
> On 16/04/2025 15:14, Anthony D'Atri wrote:
>> Curious, are your systems Dells?  If so you might see some improvement from 
>> running DSU to update all the firmware.  It might also be illuminating to 
>> run `racadm lclog view`
>> 
>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff 
>>> <janek.bevendo...@uni-weimar.de> wrote:
>>> 
>>> Hi,
>>> 
>>> Since the latest Reef update I have the problem that some of my hosts 
>>> suddenly go into a state where all CPUs are stuck in kernel mode causing 
>>> all daemons on that host to become unresponsive. When I connect to the IPMI 
>>> console, I see a lot of messages like:
>>> 
>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>> 
>>> (it's basically a list of all processes running on the machine).
>>> 
>>> Usually, this resolves itself after several minutes, but sometimes I have 
>>> to hardreset the host. When this happens, all daemons are marked as down 
>>> and I cannot interact with the host at all. I don't know what causes this 
>>> but, I think it happens primarily on the hosts where my MDS run and it 
>>> seems to be triggered by events such as cluster rebalances, MDS restarts, 
>>> or just randomly.
>>> 
>>> I found a few reports about similar issues on the bug tracker and mailing 
>>> list, but they are all very unspecific, unanswered, or more than 6 years 
>>> old.
>>> 
>>> Is there any way I can debug this? I upgraded to Squid already, but that 
>>> didn't solve the problem. I also had massive issues with this during the 
>>> upgrade. Particularly at the end when the MDS were upgraded, I had constant 
>>> struggles with it. I had to set the noout flag and then literally sit next 
>>> to it to resume the upgrade every few minutes until it finally went 
>>> through, because random MDS hosts went intermittently dark all the time.
>>> 
>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>> 
>>> Any ideas? Thanks!
>>> Janek
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to