[ceph-users] Re: "BUG: soft lockup" with MDS

Janek Bevendorff Thu, 10 Jul 2025 07:21:55 -0700

Just FYI: We upgraded our image to Ubuntu 24.04 with kernel 6.11. After a full reboot, the cluster seems to be stable again without lockups.

I still have a bunch of warnings regarding "stalled read in db device of BlueFS" and "slow operations in BlueStore", but that may be a different issue, perhaps also to do with dying hard disks.


Janek


On 12/05/2025 14:39, Frédéric Nass wrote:

Last firmware for HGST HUH721010AL5200 is 'LS21' [1] released in 2021. Your 
drives have likely been using this firmware for years, so the firmware is 
probably not the source of the issue.

Frédéric.

[1] 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverId=MGW91&lwp=rt

----- Le 12 Mai 25, à 13:48, Janek Bevendorff janek.bevendo...@uni-weimar.de a 
écrit :

I tried to install both firmwares on one of the nodes, but they're not
compatible. Most of our disks are HGST HUH721010AL5200 10TB SAS disks.


On 12/05/2025 10:51, Frédéric Nass wrote:

Hi Janek,

I just checked and we upgraded both HDD and SSD firmwares to those versions
released last month.

HDD firmware (DELL/Seagate 'ST16000NM006J'):
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=xf65r&lwp=rt
NVMe firmware (DELL/Kioxia 'Dell Ent NVMe CM7 U.2 RI 1.92TB'):
https://www.dell.com/support/home/en-us/drivers/DriversDetails?driverID=VH1YP&lwp=rt

What models are your drives?

Regards,
Frédéric.

----- Le 12 Mai 25, à 9:37, Janek Bevendorff janek.bevendo...@uni-weimar.de a
écrit :

Hi all,

Kernel is 6.8.0 (Ubuntu). The thermal settings on iDrac are already
quite high and we have an overall good cooling system, so that shouldn't
cause any issues. Our cold aisle is around 20˚C.

@Frédéric Do you have a link to the HDD firmware? I installed everything
that's available in the Dell catalogue. Also, I don't know whether CPU
usage is high during these lockups, since I cannot observe the host
state when it happens. It's like the entire node goes down until either
it recovers on its own or I do a hardreset.

Janek


On 09/05/2025 15:08, Anthony D'Atri wrote:

There are separate thermal, overall performance, and fan states in iDRAC.  I’ve
found that I often have to bump up the default “fan offset” for more cooling.

On May 9, 2025, at 8:51 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr>
wrote:

Hi Janek,

We just had a very similar issue with recent hardware (DELL R760xd) going nuts
(100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' as not
responding in time.

Switching the CPU profile to HPC (High Performance Computing) and the thermal
settings to Maximum Performance (or is it Optimized?) in BIOS, and upgrading
HDD firmware to the latest one taht was only available from DELL's website (not
yet in OpenManage catalog) fixed it.

Maybe you can give it a try.

Regards,
Frédéric.

----- Le 9 Mai 25, à 9:02, Janek Bevendorff janek.bevendo...@uni-weimar.de a
écrit :

Hi, it's happening again. I haven't fully upgraded the firmware on all
hosts yet, but at least on all MDS. I managed to finish the Ceph
upgrade, but now I'm randomly getting the soft lockups again (mostly,
but not only) on the MDS hosts.

Anything else I could check for?

Janek


On 16/04/2025 17:38, Janek Bevendorff wrote:

Yes, we have a mirror of the Dell Firmware catalogue, so the servers
can check what they need. There are three updates in total: BIOS, NIC,
and Lifecycle Controller.

I hope the BIOS update fixes this.


On 16/04/2025 17:16, Anthony D'Atri wrote:

Ack, I know the R730xd very well, mostly running Trusty and Luminous
at the time.  BIOS updates inherently require a reboot.  Check for
CPLD/SPLD as well, that changes very rarely but ISTR that this model
had at least one update after FCS.

The servers our Ceph runs on are all R730xd machines.

I checked the Dell repository manager and it looks like there is at
least one BIOS update that's newer than what we've already
installed, so I've updated our Firmware repository and will schedule
the updates now. That's going to take a long while.


On 16/04/2025 16:16, Anthony D'Atri wrote:

For whatever reason, in recent years I’ve seen these more often
with Dells than other systems. My first thought was that maybe you
were running an ancient kernel, but then I saw that you aren’t.  Is
the kernel you’re running the stock one that comes with your
distribution?  I’ve seen CPU reset events on R750s running an
elrepo kernel.

I suspect that some code change may have tickled a latent issue
that perhaps you were fortunate to have not previously run into,
but this is entirely speculation.

On Apr 16, 2025, at 9:39 AM, Janek Bevendorff
<janek.bevendo...@uni-weimar.de> wrote:

Yes, they are older Dell PowerEdges. I have to check whether
there's newer firmware, but we've been running Ceph for years
without these problems.

I checked the logs on the host on which I had a lockup just an
hour ago, but there's nothing besides the expected hardreset
messages. There are two older watchdog messages, but they are from
March:

--------------------------------------------------------------------------------

SeqNumber       = 2089
Message ID      = ASR0000
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2025-03-27 07:16:03
Message         = The watchdog timer expired.
RawEventData    =
0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF

FQDD            = WatchdogTimer.iDRAC.1
--------------------------------------------------------------------------------

SeqNumber       = 2088
Message ID      = ASR0000
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2025-03-27 07:06:41
Message         = The watchdog timer expired.
RawEventData    =
0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF

FQDD            = WatchdogTimer.iDRAC.1
--------------------------------------------------------------------------------


I grepped the logs of another host where it happened, but couldn't
find any watchtdog messages there. I believe it's also unlikely
that suddenly all MDS hosts (we have five active, five hot
standbys, and one cold standby) start having hardware issues. I
also ran a memtest on one of the hosts last week and couldn't find
anything there either.



On 16/04/2025 15:14, Anthony D'Atri wrote:

Curious, are your systems Dells? If so you might see some
improvement from running DSU to update all the firmware.  It
might also be illuminating to run `racadm lclog view`

On Apr 16, 2025, at 8:32 AM, Janek Bevendorff
<janek.bevendo...@uni-weimar.de> wrote:

Hi,

Since the latest Reef update I have the problem that some of my
hosts suddenly go into a state where all CPUs are stuck in
kernel mode causing all daemons on that host to become
unresponsive. When I connect to the IPMI console, I see a lot of
messages like:

watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]

(it's basically a list of all processes running on the machine).

Usually, this resolves itself after several minutes, but
sometimes I have to hardreset the host. When this happens, all
daemons are marked as down and I cannot interact with the host
at all. I don't know what causes this but, I think it happens
primarily on the hosts where my MDS run and it seems to be
triggered by events such as cluster rebalances, MDS restarts, or
just randomly.

I found a few reports about similar issues on the bug tracker
and mailing list, but they are all very unspecific, unanswered,
or more than 6 years old.

Is there any way I can debug this? I upgraded to Squid already,
but that didn't solve the problem. I also had massive issues
with this during the upgrade. Particularly at the end when the
MDS were upgraded, I had constant struggles with it. I had to
set the noout flag and then literally sit next to it to resume
the upgrade every few minutes until it finally went through,
because random MDS hosts went intermittently dark all the time.

All hosts run Ubuntu 22.04 with kernel 6.8.0.

Any ideas? Thanks!
Janek

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to