[ceph-users] Re: "BUG: soft lockup" with MDS

Anthony D'Atri Wed, 16 Apr 2025 08:17:28 -0700

Ack, I know the R730xd very well, mostly running Trusty and Luminous at the 
time.  BIOS updates inherently require a reboot.  Check for CPLD/SPLD as well, 
that changes very rarely but ISTR that this model had at least one update after 
FCS.



> 
> The servers our Ceph runs on are all R730xd machines.
> 
> I checked the Dell repository manager and it looks like there is at least one 
> BIOS update that's newer than what we've already installed, so I've updated 
> our Firmware repository and will schedule the updates now. That's going to 
> take a long while.
> 
> 
> On 16/04/2025 16:16, Anthony D'Atri wrote:
>> For whatever reason, in recent years I’ve seen these more often with Dells 
>> than other systems.  My first thought was that maybe you were running an 
>> ancient kernel, but then I saw that you aren’t.  Is the kernel you’re 
>> running the stock one that comes with your distribution?  I’ve seen CPU 
>> reset events on R750s running an elrepo kernel.
>> 
>> I suspect that some code change may have tickled a latent issue that perhaps 
>> you were fortunate to have not previously run into, but this is entirely 
>> speculation.
>> 
>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff 
>>> <janek.bevendo...@uni-weimar.de> wrote:
>>> 
>>> Yes, they are older Dell PowerEdges. I have to check whether there's newer 
>>> firmware, but we've been running Ceph for years without these problems.
>>> 
>>> I checked the logs on the host on which I had a lockup just an hour ago, 
>>> but there's nothing besides the expected hardreset messages. There are two 
>>> older watchdog messages, but they are from March:
>>> 
>>> --------------------------------------------------------------------------------
>>> SeqNumber       = 2089
>>> Message ID      = ASR0000
>>> Category        = System
>>> AgentID         = SEL
>>> Severity        = Critical
>>> Timestamp       = 2025-03-27 07:16:03
>>> Message         = The watchdog timer expired.
>>> RawEventData    = 
>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>> 
>>> FQDD            = WatchdogTimer.iDRAC.1
>>> --------------------------------------------------------------------------------
>>> SeqNumber       = 2088
>>> Message ID      = ASR0000
>>> Category        = System
>>> AgentID         = SEL
>>> Severity        = Critical
>>> Timestamp       = 2025-03-27 07:06:41
>>> Message         = The watchdog timer expired.
>>> RawEventData    = 
>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>> 
>>> FQDD            = WatchdogTimer.iDRAC.1
>>> --------------------------------------------------------------------------------
>>> 
>>> I grepped the logs of another host where it happened, but couldn't find any 
>>> watchtdog messages there. I believe it's also unlikely that suddenly all 
>>> MDS hosts (we have five active, five hot standbys, and one cold standby) 
>>> start having hardware issues. I also ran a memtest on one of the hosts last 
>>> week and couldn't find anything there either.
>>> 
>>> 
>>> 
>>> On 16/04/2025 15:14, Anthony D'Atri wrote:
>>>> Curious, are your systems Dells?  If so you might see some improvement 
>>>> from running DSU to update all the firmware.  It might also be 
>>>> illuminating to run `racadm lclog view`
>>>> 
>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff 
>>>>> <janek.bevendo...@uni-weimar.de> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Since the latest Reef update I have the problem that some of my hosts 
>>>>> suddenly go into a state where all CPUs are stuck in kernel mode causing 
>>>>> all daemons on that host to become unresponsive. When I connect to the 
>>>>> IPMI console, I see a lot of messages like:
>>>>> 
>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>>>> 
>>>>> (it's basically a list of all processes running on the machine).
>>>>> 
>>>>> Usually, this resolves itself after several minutes, but sometimes I have 
>>>>> to hardreset the host. When this happens, all daemons are marked as down 
>>>>> and I cannot interact with the host at all. I don't know what causes this 
>>>>> but, I think it happens primarily on the hosts where my MDS run and it 
>>>>> seems to be triggered by events such as cluster rebalances, MDS restarts, 
>>>>> or just randomly.
>>>>> 
>>>>> I found a few reports about similar issues on the bug tracker and mailing 
>>>>> list, but they are all very unspecific, unanswered, or more than 6 years 
>>>>> old.
>>>>> 
>>>>> Is there any way I can debug this? I upgraded to Squid already, but that 
>>>>> didn't solve the problem. I also had massive issues with this during the 
>>>>> upgrade. Particularly at the end when the MDS were upgraded, I had 
>>>>> constant struggles with it. I had to set the noout flag and then 
>>>>> literally sit next to it to resume the upgrade every few minutes until it 
>>>>> finally went through, because random MDS hosts went intermittently dark 
>>>>> all the time.
>>>>> 
>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>>>> 
>>>>> Any ideas? Thanks!
>>>>> Janek
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> Bauhaus-Universität Weimar
> Bauhausstr. 9a, R308
> 99423 Weimar, Germany
> 
> Phone: +49 3643 58 3577
> www.webis.de
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to