Ack, I know the R730xd very well, mostly running Trusty and Luminous at the time. BIOS updates inherently require a reboot. Check for CPLD/SPLD as well, that changes very rarely but ISTR that this model had at least one update after FCS.
> > The servers our Ceph runs on are all R730xd machines. > > I checked the Dell repository manager and it looks like there is at least one > BIOS update that's newer than what we've already installed, so I've updated > our Firmware repository and will schedule the updates now. That's going to > take a long while. > > > On 16/04/2025 16:16, Anthony D'Atri wrote: >> For whatever reason, in recent years I’ve seen these more often with Dells >> than other systems. My first thought was that maybe you were running an >> ancient kernel, but then I saw that you aren’t. Is the kernel you’re >> running the stock one that comes with your distribution? I’ve seen CPU >> reset events on R750s running an elrepo kernel. >> >> I suspect that some code change may have tickled a latent issue that perhaps >> you were fortunate to have not previously run into, but this is entirely >> speculation. >> >>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff >>> <janek.bevendo...@uni-weimar.de> wrote: >>> >>> Yes, they are older Dell PowerEdges. I have to check whether there's newer >>> firmware, but we've been running Ceph for years without these problems. >>> >>> I checked the logs on the host on which I had a lockup just an hour ago, >>> but there's nothing besides the expected hardreset messages. There are two >>> older watchdog messages, but they are from March: >>> >>> -------------------------------------------------------------------------------- >>> SeqNumber = 2089 >>> Message ID = ASR0000 >>> Category = System >>> AgentID = SEL >>> Severity = Critical >>> Timestamp = 2025-03-27 07:16:03 >>> Message = The watchdog timer expired. >>> RawEventData = >>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>> >>> FQDD = WatchdogTimer.iDRAC.1 >>> -------------------------------------------------------------------------------- >>> SeqNumber = 2088 >>> Message ID = ASR0000 >>> Category = System >>> AgentID = SEL >>> Severity = Critical >>> Timestamp = 2025-03-27 07:06:41 >>> Message = The watchdog timer expired. >>> RawEventData = >>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>> >>> FQDD = WatchdogTimer.iDRAC.1 >>> -------------------------------------------------------------------------------- >>> >>> I grepped the logs of another host where it happened, but couldn't find any >>> watchtdog messages there. I believe it's also unlikely that suddenly all >>> MDS hosts (we have five active, five hot standbys, and one cold standby) >>> start having hardware issues. I also ran a memtest on one of the hosts last >>> week and couldn't find anything there either. >>> >>> >>> >>> On 16/04/2025 15:14, Anthony D'Atri wrote: >>>> Curious, are your systems Dells? If so you might see some improvement >>>> from running DSU to update all the firmware. It might also be >>>> illuminating to run `racadm lclog view` >>>> >>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>>>> <janek.bevendo...@uni-weimar.de> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Since the latest Reef update I have the problem that some of my hosts >>>>> suddenly go into a state where all CPUs are stuck in kernel mode causing >>>>> all daemons on that host to become unresponsive. When I connect to the >>>>> IPMI console, I see a lot of messages like: >>>>> >>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>>>> >>>>> (it's basically a list of all processes running on the machine). >>>>> >>>>> Usually, this resolves itself after several minutes, but sometimes I have >>>>> to hardreset the host. When this happens, all daemons are marked as down >>>>> and I cannot interact with the host at all. I don't know what causes this >>>>> but, I think it happens primarily on the hosts where my MDS run and it >>>>> seems to be triggered by events such as cluster rebalances, MDS restarts, >>>>> or just randomly. >>>>> >>>>> I found a few reports about similar issues on the bug tracker and mailing >>>>> list, but they are all very unspecific, unanswered, or more than 6 years >>>>> old. >>>>> >>>>> Is there any way I can debug this? I upgraded to Squid already, but that >>>>> didn't solve the problem. I also had massive issues with this during the >>>>> upgrade. Particularly at the end when the MDS were upgraded, I had >>>>> constant struggles with it. I had to set the noout flag and then >>>>> literally sit next to it to resume the upgrade every few minutes until it >>>>> finally went through, because random MDS hosts went intermittently dark >>>>> all the time. >>>>> >>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>>>> >>>>> Any ideas? Thanks! >>>>> Janek >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io