[ceph-users] Re: "BUG: soft lockup" with MDS

Anthony D'Atri Wed, 16 Apr 2025 07:19:57 -0700

For whatever reason, in recent years I’ve seen these more often with Dells than 
other systems.  My first thought was that maybe you were running an ancient 
kernel, but then I saw that you aren’t.  Is the kernel you’re running the stock 
one that comes with your distribution?  I’ve seen CPU reset events on R750s 
running an elrepo kernel.


I suspect that some code change may have tickled a latent issue that perhaps 
you were fortunate to have not previously run into, but this is entirely 
speculation.

> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff 
> <janek.bevendo...@uni-weimar.de> wrote:
> 
> Yes, they are older Dell PowerEdges. I have to check whether there's newer 
> firmware, but we've been running Ceph for years without these problems.
> 
> I checked the logs on the host on which I had a lockup just an hour ago, but 
> there's nothing besides the expected hardreset messages. There are two older 
> watchdog messages, but they are from March:
> 
> --------------------------------------------------------------------------------
> SeqNumber       = 2089
> Message ID      = ASR0000
> Category        = System
> AgentID         = SEL
> Severity        = Critical
> Timestamp       = 2025-03-27 07:16:03
> Message         = The watchdog timer expired.
> RawEventData    = 
> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
> 
> FQDD            = WatchdogTimer.iDRAC.1
> --------------------------------------------------------------------------------
> SeqNumber       = 2088
> Message ID      = ASR0000
> Category        = System
> AgentID         = SEL
> Severity        = Critical
> Timestamp       = 2025-03-27 07:06:41
> Message         = The watchdog timer expired.
> RawEventData    = 
> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
> 
> FQDD            = WatchdogTimer.iDRAC.1
> --------------------------------------------------------------------------------
> 
> I grepped the logs of another host where it happened, but couldn't find any 
> watchtdog messages there. I believe it's also unlikely that suddenly all MDS 
> hosts (we have five active, five hot standbys, and one cold standby) start 
> having hardware issues. I also ran a memtest on one of the hosts last week 
> and couldn't find anything there either.
> 
> 
> 
> On 16/04/2025 15:14, Anthony D'Atri wrote:
>> Curious, are your systems Dells?  If so you might see some improvement from 
>> running DSU to update all the firmware.  It might also be illuminating to 
>> run `racadm lclog view`
>> 
>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff 
>>> <janek.bevendo...@uni-weimar.de> wrote:
>>> 
>>> Hi,
>>> 
>>> Since the latest Reef update I have the problem that some of my hosts 
>>> suddenly go into a state where all CPUs are stuck in kernel mode causing 
>>> all daemons on that host to become unresponsive. When I connect to the IPMI 
>>> console, I see a lot of messages like:
>>> 
>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>> 
>>> (it's basically a list of all processes running on the machine).
>>> 
>>> Usually, this resolves itself after several minutes, but sometimes I have 
>>> to hardreset the host. When this happens, all daemons are marked as down 
>>> and I cannot interact with the host at all. I don't know what causes this 
>>> but, I think it happens primarily on the hosts where my MDS run and it 
>>> seems to be triggered by events such as cluster rebalances, MDS restarts, 
>>> or just randomly.
>>> 
>>> I found a few reports about similar issues on the bug tracker and mailing 
>>> list, but they are all very unspecific, unanswered, or more than 6 years 
>>> old.
>>> 
>>> Is there any way I can debug this? I upgraded to Squid already, but that 
>>> didn't solve the problem. I also had massive issues with this during the 
>>> upgrade. Particularly at the end when the MDS were upgraded, I had constant 
>>> struggles with it. I had to set the noout flag and then literally sit next 
>>> to it to resume the upgrade every few minutes until it finally went 
>>> through, because random MDS hosts went intermittently dark all the time.
>>> 
>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>> 
>>> Any ideas? Thanks!
>>> Janek
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to